*** HOW TO CHOOSE THE BIN SIZE OF A HISTOGRAM ? *** [NB: Implemented in NAPA header] : ======================= Scott rule ======================= From http://www.fmrib.ox.ac.uk/analysis/techrep/tr00mj2/tr00mj2/node24.html It has been shown [Scott, 1979] that the optimal histogram bin size, which provides the most efficient, unbiased estimation of the probability density Function, is achieved when: Bin size = 3.49 S / N**(1/3) where W is the width of the histogram bin, S is the standard deviation of the distribution and N is the number of available samples. In practice, the estimated standard deviation, s, must be used. .................................................................................................................................. [NB: NOT yet implemented in NAPA header] : A similar, but more robust, result was also obtained by Freedman and Diaconis (summarised in [Izenman, 1991]), which gives the bin width as: ======================== Freedman-Diaconis rule ======================== From Wikipedia, the free encyclopedia In statistics, the Freedman-Diaconis rule is used to specify the number of bins to be used in a histogram. It is used to smooth the data. The general equation for the rule is: Bin size = 2.0 IQR / N**(1/3) where IQR is the interquartile range of the data and N is the number of observations in the sample. Reference: Freedman D and Diaconis P (1981). On the histogram as a density estimator:L2 theory. Probability Theory and Related Fields. 57(4): 453-476 .................................................................................................................................. [NB: NOT yet implemented in NAPA header] : ======================== Interquartile range ======================== From Wikipedia, the free encyclopedia In descriptive statistics, the interquartile range (IQR) is the difference between the third and first quartiles and is a measure of statistical dispersion. The interquartile range is a more stable statistic than the range, and is often preferred to that statistic. Since 25% of the data are less than or equal to the first quartile and 25% are greater than or equal to the third quartile, the difference is the length of an interval that includes about half of the data. This difference should be measured in the same units as the data. Interquantile range is used to build Box plots, that can give a simple graphical representation of a probability distribution. Example i x[i] 1 102 2 104 3 105 ---- the first quartile, Q1 = 105 4 107 5 108 6 109 ---- the second quartile, Q2 or median = 109 7 110 8 112 9 115 ---- the third quartile, Q3 = 115 10 115 11 118 From this table, the interquartile range is 115 - 105 = 10. The interquartile mean is a measure of statistical dispersion. For normally distributed data: IQR = 1.35 Sigma .................................................................................................................................. [NB: NOT yet implemented in NAPA header] : ======================== Quartile ======================== From Wikipedia, the free encyclopedia In descriptive statistics, a quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sample or population. Thus: first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile second quartile (designated Q2) = median = cuts data set in half = 50th percentile third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile The difference between the upper and lower quartiles is called the interquartile range. Example 1: Data Set: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36 Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49 Q1 = 15 Q2 = 40 Q3 = 43 Example 2: Ordered Data Set: 7, 15, 36, 39, 40, 41 Q1 = 15 Q2 = (39+36)/2 = 37.5 Q3 = 40 ..................................................................................................................................