Fiveable

๐Ÿ“ŠProbability and Statistics Unit 7 Review

QR code for Probability and Statistics practice questions

7.2 Measures of dispersion

๐Ÿ“ŠProbability and Statistics
Unit 7 Review

7.2 Measures of dispersion

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“ŠProbability and Statistics
Unit & Topic Study Guides

Measures of dispersion quantify how spread out data points are in a dataset. These tools help statisticians understand variability and distribution, providing crucial insights into data patterns and outliers.

Variance, standard deviation, range, and interquartile range are key dispersion measures. Each offers unique perspectives on data spread, with some being more robust to outliers than others. Understanding these measures is essential for effective data analysis and interpretation.

Variance and standard deviation

  • Variance and standard deviation quantify the spread or dispersion of a dataset around its mean
  • These measures are essential for understanding the variability and distribution of data in probability and statistics

Population vs sample variance

  • Population variance $\sigma^2$ represents the average squared deviation from the mean for an entire population
  • Sample variance $s^2$ estimates the population variance using a subset of data (sample)
    • Uses $n-1$ in the denominator as a correction factor to account for bias
  • Formulas:
    • Population variance: $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
    • Sample variance: $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$

Calculating variance

  • Calculate the mean of the dataset
  • Subtract the mean from each data point and square the result
  • Sum the squared differences and divide by the number of data points (or $n-1$ for sample variance)
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • Mean: $\bar{x} = \frac{4+7+9+12+18}{5} = 10$
    • Squared differences: $(4-10)^2 = 36$, $(7-10)^2 = 9$, $(9-10)^2 = 1$, $(12-10)^2 = 4$, $(18-10)^2 = 64$
    • Sample variance: $s^2 = \frac{36+9+1+4+64}{4} = 28.5$

Standard deviation from variance

  • Standard deviation is the square root of the variance
  • Represents the average distance of data points from the mean
  • Has the same units as the original data, making it more interpretable than variance
  • Formulas:
    • Population standard deviation: $\sigma = \sqrt{\sigma^2}$
    • Sample standard deviation: $s = \sqrt{s^2}$

Interpreting standard deviation

  • A low standard deviation indicates data points are clustered closely around the mean (less dispersion)
  • A high standard deviation suggests data points are spread out over a wider range (more dispersion)
  • Approximately 68% of data falls within one standard deviation of the mean in a normal distribution (empirical rule)
  • Comparing standard deviations allows for assessing the relative variability of different datasets

Range and interquartile range

  • Range and interquartile range (IQR) are measures of dispersion that do not rely on the mean
  • These measures are less sensitive to outliers compared to variance and standard deviation

Calculating range

  • Range is the difference between the maximum and minimum values in a dataset
  • Provides a simple measure of the total spread of the data
  • Formula: $Range = max(x) - min(x)$
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • Range: $18 - 4 = 14$

Percentiles and quartiles

  • Percentiles divide a dataset into 100 equal parts
  • Quartiles divide a dataset into four equal parts (Q1, Q2 or median, Q3)
  • Calculating quartiles:
    • First, arrange the data in ascending order
    • Q1 is the middle value between the minimum and the median
    • Q3 is the middle value between the median and the maximum

Interquartile range (IQR)

  • IQR is the range of the middle 50% of the data
  • Calculated as the difference between the third quartile (Q3) and the first quartile (Q1)
  • Formula: $IQR = Q3 - Q1$
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • Q1: 7, Q3: 12
    • IQR: $12 - 7 = 5$

Outliers and IQR

  • IQR can be used to identify potential outliers in a dataset
  • Outliers are data points that fall below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • IQR: 5
    • Lower fence: $7 - 1.5 \times 5 = -0.5$
    • Upper fence: $12 + 1.5 \times 5 = 19.5$
    • No outliers in this dataset

Coefficient of variation

  • Coefficient of variation (CV) is a standardized measure of dispersion
  • Useful for comparing the relative variability of datasets with different units or means

Relative variability

  • CV expresses the standard deviation as a percentage of the mean
  • Allows for comparing the variability of datasets with different scales
  • A higher CV indicates greater relative variability

Calculating coefficient of variation

  • Formula: $CV = \frac{s}{\bar{x}} \times 100%$
    • $s$ is the sample standard deviation
    • $\bar{x}$ is the sample mean
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • Sample mean: $\bar{x} = 10$
    • Sample standard deviation: $s \approx 5.34$
    • CV: $\frac{5.34}{10} \times 100% \approx 53.4%$

Comparing distributions with CV

  • When comparing datasets, a higher CV indicates greater relative variability
  • CV is dimensionless, allowing for comparison of variability across different types of data
  • Limitations:
    • Not suitable for datasets with means close to zero or negative means
    • Sensitive to small changes in the mean when the mean is close to zero

Mean absolute deviation

  • Mean absolute deviation (MAD) is another measure of dispersion
  • Based on the absolute differences between each data point and the mean

Calculating MAD

  • Calculate the mean of the dataset
  • Compute the absolute difference between each data point and the mean
  • Sum the absolute differences and divide by the number of data points
  • Formula: $MAD = \frac{\sum_{i=1}^{n} |x_i - \bar{x}|}{n}$
  • Example:
    • Dataset: 4, 7, 9, 12, 18
    • Mean: $\bar{x} = 10$
    • Absolute differences: $|4-10| = 6$, $|7-10| = 3$, $|9-10| = 1$, $|12-10| = 2$, $|18-10| = 8$
    • MAD: $\frac{6+3+1+2+8}{5} = 4$

MAD vs standard deviation

  • MAD is less sensitive to outliers compared to standard deviation
  • Standard deviation squares the differences, giving more weight to larger deviations
  • MAD is more robust and may be preferred when dealing with datasets containing outliers
  • However, standard deviation is more mathematically tractable and widely used in statistical analysis

Chebyshev's inequality

  • Chebyshev's inequality provides a bound on the proportion of data within a certain number of standard deviations from the mean
  • Applicable to any dataset, regardless of its distribution

Proportion of data within k standard deviations

  • Chebyshev's inequality states that at least $1 - \frac{1}{k^2}$ of the data falls within $k$ standard deviations of the mean
  • Formula: $P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$, where $k > 1$
  • Example:
    • For $k = 2$, at least $1 - \frac{1}{2^2} = 0.75$ or 75% of the data falls within 2 standard deviations of the mean
    • For $k = 3$, at least $1 - \frac{1}{3^2} \approx 0.89$ or 89% of the data falls within 3 standard deviations of the mean

Applications of Chebyshev's inequality

  • Provides a conservative bound on the proportion of data within a certain range
  • Useful when the underlying distribution is unknown or non-normal
  • Helps in identifying potential outliers or anomalies in a dataset
  • Used in various fields, such as finance (value at risk) and quality control (process monitoring)

Dispersion in non-normal distributions

  • Measures of dispersion can behave differently in non-normal distributions
  • Skewness and kurtosis are important factors to consider when analyzing dispersion in these cases

Skewness and dispersion

  • Skewness measures the asymmetry of a distribution
  • Positive skew: tail on the right side is longer or fatter (right-skewed)
    • Mean > Median > Mode
  • Negative skew: tail on the left side is longer or fatter (left-skewed)
    • Mode > Median > Mean
  • In skewed distributions, measures like standard deviation may not accurately capture the dispersion

Kurtosis and dispersion

  • Kurtosis measures the heaviness of the tails and the peakedness of a distribution
  • Leptokurtic: heavy tails and a sharp peak (positive kurtosis)
    • More outliers and higher dispersion in the tails compared to a normal distribution
  • Platykurtic: light tails and a flatter peak (negative kurtosis)
    • Fewer outliers and lower dispersion in the tails compared to a normal distribution
  • Mesokurtic: normal distribution (kurtosis = 0)

Robust measures of dispersion

  • In the presence of skewness, kurtosis, or outliers, robust measures of dispersion are preferred
  • Median absolute deviation (MAD) is a robust alternative to standard deviation
    • Calculates the median of the absolute deviations from the median
    • Less sensitive to outliers and skewness
  • Interquartile range (IQR) is another robust measure
    • Focuses on the middle 50% of the data, ignoring the tails
  • Trimmed or Winsorized standard deviation
    • Removes or limits the influence of a certain percentage of the highest and lowest values before calculating the standard deviation
    • Reduces the impact of outliers on the dispersion measure