Assessing normality and homoscedasticity is crucial for validating linear regression models. These assumptions ensure reliable estimates and inferences. Violations can lead to biased coefficients and incorrect conclusions about relationships between variables.
Graphical methods and statistical tests help detect assumption violations. Residual plots, normality tests, and heteroscedasticity checks guide researchers in identifying issues. Understanding these diagnostics is essential for making informed decisions about model validity and potential remedial measures.
Normality and Homoscedasticity Assumptions
Understanding the Assumptions
- Normality assumption states that the residuals (errors) of a linear regression model should follow a normal distribution with a mean of zero
- Homoscedasticity assumption requires that the variance of the residuals is constant across all levels of the independent variable(s)
- Homoscedasticity implies that the spread of the residuals should be consistent, without any systematic patterns or changes in variance
- Violations of these assumptions can lead to biased and inefficient estimates of the regression coefficients and standard errors
Implications of Violated Assumptions
- Non-normality of residuals can affect the validity of hypothesis tests and confidence intervals, as they rely on the assumption of normally distributed errors
- Example: If the residuals are heavily skewed or have outliers, the t-tests and confidence intervals for the regression coefficients may be unreliable
- Heteroscedasticity (non-constant variance) can result in incorrect standard errors and p-values, leading to invalid inferences about the significance of the regression coefficients
- Example: If the variance of the residuals increases with higher values of the independent variable, the standard errors may be underestimated, resulting in overly optimistic p-values and potentially false conclusions about the significance of the coefficients
Assessing Residual Normality
Graphical Methods
- Visual inspection of residual plots can provide insights into the normality assumption
- Histogram of residuals should exhibit a bell-shaped, symmetric distribution around zero
- Normal probability plot (Q-Q plot) of residuals should show points close to a straight diagonal line if the residuals are normally distributed
- Example: If the Q-Q plot shows a systematic departure from the diagonal line, such as an S-shaped pattern, it suggests non-normality of the residuals
Statistical Tests
- Shapiro-Wilk test is a commonly used statistical test for assessing normality of residuals
- The null hypothesis of the Shapiro-Wilk test is that the residuals are normally distributed
- A small p-value (typically < 0.05) indicates a deviation from normality, while a large p-value suggests the residuals are consistent with a normal distribution
- Kolmogorov-Smirnov test is another statistical test for normality, comparing the empirical cumulative distribution function of the residuals to the theoretical normal distribution
- Skewness and kurtosis measures can also be used to assess the symmetry and heaviness of the tails of the residual distribution, respectively
- Example: A skewness value close to zero indicates a symmetric distribution, while a positive or negative skewness suggests right or left skewness, respectively
Detecting Heteroscedasticity
Residual Plots
- Residual plots can reveal patterns of heteroscedasticity
- Plotting residuals against the predicted values (fitted values) of the dependent variable
- If the spread of residuals increases or decreases systematically with the predicted values, it indicates the presence of heteroscedasticity
- Example: If the residual plot shows a fan-shaped pattern, with the spread of residuals increasing as the predicted values increase, it suggests heteroscedasticity
Statistical Tests
- Breusch-Pagan test is a statistical test for detecting heteroscedasticity
- The null hypothesis is that the variance of the residuals is constant (homoscedasticity)
- A small p-value (typically < 0.05) suggests the presence of heteroscedasticity
- White's test is another statistical test for heteroscedasticity that does not assume a specific form of heteroscedasticity
- Goldfeld-Quandt test compares the variance of residuals between two subsamples of the data, typically split based on the values of an independent variable suspected to cause heteroscedasticity
- Example: If the variance of residuals is significantly different between the subsamples, it indicates heteroscedasticity related to that independent variable
Consequences of Violated Assumptions
Impact on Coefficient Estimates and Inferences
- Violation of normality assumption can lead to biased and unreliable estimates of regression coefficients and standard errors
- Non-normal residuals can affect the validity of hypothesis tests and confidence intervals, leading to incorrect conclusions
- In severe cases of non-normality, the least squares estimators may not be the most efficient or appropriate
- Heteroscedasticity can result in inefficient estimates of regression coefficients and biased standard errors
- The standard errors of the regression coefficients may be underestimated or overestimated, affecting the reliability of hypothesis tests and confidence intervals
- Heteroscedasticity can lead to incorrect conclusions about the significance of the independent variables
Remedial Measures
- Violations of these assumptions can impact the reliability and validity of the linear regression model and its inferences
- Remedial measures, such as data transformations or robust regression techniques, may be necessary to address violations of normality and homoscedasticity assumptions
- Example: Applying a logarithmic transformation to the dependent variable can sometimes help stabilize the variance and improve normality of residuals
- Example: Using weighted least squares regression or robust regression methods (e.g., Huber-White standard errors) can account for heteroscedasticity and provide more reliable estimates and inferences