Fiveable

๐Ÿ“ŠHonors Statistics Unit 12 Review

QR code for Honors Statistics practice questions

12.5 Outliers

๐Ÿ“ŠHonors Statistics
Unit 12 Review

12.5 Outliers

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“ŠHonors Statistics
Unit & Topic Study Guides

Outliers can significantly impact statistical analyses, potentially skewing results and leading to incorrect conclusions. Identifying and handling these data points is crucial for accurate interpretation of your data.

There are several methods to detect outliers, including the standard deviation rule and examining residuals. Removing outliers can dramatically change regression lines and correlation coefficients, highlighting the importance of careful consideration in data analysis.

Outliers

Outliers using standard deviations rule

  • Data points significantly different from the rest of the data can substantially impact statistical analyses and should be carefully examined
  • Common method for identifying potential outliers by calculating the mean and standard deviation of the dataset
  • Any data point falling more than two standard deviations away from the mean is considered a potential outlier (z-score greater than 2 or less than -2)
    • Lower boundary calculated as $mean - 2 standard deviation$
    • Upper boundary calculated as $mean + 2 standard deviation$
  • Data points outside these boundaries should be further investigated to determine if they are true outliers (measurement errors) or merely unusual but valid observations (rare events)

Effects of outlier removal

  • Outliers can significantly impact the regression line (best fits the data points, minimizing sum of squared residuals) and correlation coefficient (measure of linear relationship between two variables, ranging from -1 to 1)
  • Removing an outlier can change the slope and intercept of the regression line
    • Outlier far from the rest of the data points can "pull" the regression line towards itself
    • Removing the outlier can result in a regression line better representing the majority of the data
  • Correlation coefficient can be affected by the presence of outliers
    • Outliers can artificially increase (positive outlier in a positive relationship) or decrease (negative outlier in a positive relationship) the correlation coefficient
    • Removing an outlier can lead to a correlation coefficient more accurately reflecting the relationship between the variables for the majority of the data

Standard deviation of residuals

  • Residuals are differences between observed values and predicted values from a regression line, calculated as $residual = observed value - predicted value$
  • Standard deviation of residuals can be used to identify potential outliers
    1. Calculate predicted values for each data point using the regression equation
    2. Compute residuals by subtracting predicted values from observed values
    3. Calculate standard deviation of the residuals
  • Any data point with a residual more than two standard deviations away from zero is considered a potential outlier
    • Lower boundary calculated as $-2 standard deviation of residuals$
    • Upper boundary calculated as $2 standard deviation of residuals$
  • This method helps identify outliers based on their deviation from the regression line, rather than their deviation from the mean of the dataset

Additional outlier detection methods

  • Interquartile range (IQR) method: Identifies outliers based on the spread of the middle 50% of the data
    • Outliers are typically defined as data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
    • This method is often visualized using a boxplot, which displays the median, quartiles, and potential outliers
  • Leverage: Measures the influence of a data point on the regression line based on its distance from the mean of the predictor variable
  • Influential points: Data points that have a disproportionate effect on the regression results
    • Cook's distance is a measure used to identify influential points by quantifying the change in regression coefficients when an observation is excluded