📊Advanced Quantitative Methods Unit 6 Review

6.3 Regression diagnostics and model selection

📊Advanced Quantitative Methods
Unit 6 Review

6.3 Regression diagnostics and model selection

Written by the Fiveable Content Team • Last updated September 2025

📊Advanced Quantitative Methods

Unit & Topic Study Guides

6.1 Simple linear regression

6.2 Multiple linear regression

6.3 Regression diagnostics and model selection

6.4 Logistic regression

Regression diagnostics and model selection are crucial for ensuring the validity and reliability of your regression analysis. These techniques help you assess model assumptions, identify potential issues, and choose the best predictors for your model.

By examining residuals, testing for heteroscedasticity, and using methods like cross-validation, you can improve your model's accuracy. Understanding these tools will make you a more effective data analyst and help you avoid common pitfalls in regression analysis.

Assumptions of Linear Regression

Linearity and Independence

Linear regression models assume a linear relationship between the dependent variable and independent variables
Nonlinearity can be assessed using residual plots (residuals vs. fitted values, residuals vs. each independent variable)
Independence assumes that observations are independent of each other
Violations of independence can be detected using the Durbin-Watson test or by examining residual plots for patterns (systematic trends, clusters)

Homoscedasticity and Normality

Homoscedasticity assumes constant variance of residuals across all levels of the independent variables
Heteroscedasticity (non-constant variance) can be identified using residual plots or statistical tests like the Breusch-Pagan test
Normality assumes that residuals follow a normal distribution
Normality can be checked using histograms (bell-shaped), Q-Q plots (straight line), or statistical tests like the Shapiro-Wilk test

Multicollinearity

Multicollinearity occurs when independent variables are highly correlated, which can affect the interpretation and stability of regression coefficients
Multicollinearity can be assessed using correlation matrices or variance inflation factors (VIF)
VIF values greater than 5 or 10 indicate high multicollinearity, which may require addressing through variable selection or transformation (centering, scaling)

Regression Model Diagnostics

Residual Analysis

Residual plots, such as residuals vs. fitted values and residuals vs. each independent variable, can help identify violations of linearity, independence, and homoscedasticity
Patterns in residual plots (curved shapes, systematic trends) suggest violations of assumptions
The Durbin-Watson test is used to detect autocorrelation in residuals, which violates the independence assumption
- The test statistic ranges from 0 to 4, with values close to 2 indicating no autocorrelation
- Values substantially below 2 indicate positive autocorrelation, while values substantially above 2 indicate negative autocorrelation

Heteroscedasticity and Normality Tests

The Breusch-Pagan test is used to detect heteroscedasticity
- It tests the null hypothesis that the variance of residuals is constant across all levels of the independent variables
- Rejecting the null hypothesis suggests the presence of heteroscedasticity
Histograms and Q-Q plots of residuals can help assess the normality assumption
- A bell-shaped histogram and a straight line in the Q-Q plot suggest normality
The Shapiro-Wilk test is a formal statistical test for normality
- It tests the null hypothesis that the residuals are normally distributed
- Rejecting the null hypothesis indicates a departure from normality

Model Selection Techniques

Subset Selection Methods

Model selection involves choosing the best subset of predictors that balance model fit and complexity
Forward selection starts with an empty model and iteratively adds the most significant predictor until no significant improvement in model fit is achieved
Backward elimination starts with a full model containing all predictors and iteratively removes the least significant predictor until all remaining predictors are significant
Stepwise selection combines forward selection and backward elimination, allowing predictors to be added or removed at each step based on their significance

Information Criteria and Adjusted R-squared

Information criteria, such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), provide a trade-off between model fit and complexity
- Lower values of AIC and BIC indicate better models
- AIC and BIC penalize model complexity, favoring simpler models with fewer predictors
Adjusted R-squared adjusts the regular R-squared for the number of predictors in the model, penalizing complex models
- Higher adjusted R-squared values are preferred
- Adjusted R-squared helps prevent overfitting by considering the number of predictors

Cross-Validation

Cross-validation techniques, such as k-fold cross-validation, assess the predictive performance of models on unseen data
- The data is divided into k subsets (folds), and the model is trained and evaluated k times, each time using a different fold as the validation set
- Cross-validation helps to prevent overfitting and provides a more reliable estimate of model performance

Limitations of Regression Analysis

Causality and Outliers

Regression analysis assumes a causal relationship between the dependent variable and independent variables, but it cannot prove causality
- Causal inferences require additional assumptions and experimental designs (randomized controlled trials)
Regression models are sensitive to outliers, which can have a disproportionate influence on the estimated coefficients and model fit
- Outliers should be carefully examined and handled appropriately (removal, transformation, robust regression methods)

Extrapolation and Omitted Variables

Extrapolation beyond the range of observed data can lead to unreliable predictions
- Regression models are best used for interpolation within the range of the data
- Extrapolating too far beyond the observed data range can result in inaccurate and misleading predictions
Omitted variable bias occurs when important predictors are not included in the model, leading to biased estimates of the included predictors' effects
- Omitted variables can confound the relationship between the dependent variable and included predictors
- Careful consideration of potential confounding variables and subject matter expertise is crucial

Measurement Errors and Model Assumptions

Measurement errors in the independent variables can lead to biased and inconsistent estimates of the regression coefficients, known as attenuation bias
- Measurement errors can attenuate (weaken) the estimated relationships between variables
- Reliable and valid measurement of variables is essential for accurate regression results
Regression models assume that the relationship between the dependent variable and independent variables remains constant over time
- Changes in the underlying relationship can affect the validity of the model
- Model validation and updating may be necessary to ensure the model remains relevant and accurate over time

📊Advanced Quantitative Methods Unit 6 Review

6.3 Regression diagnostics and model selection

📊Advanced Quantitative Methods Unit 6 Review

6.3 Regression diagnostics and model selection

Unit & Topic Study Guides

Assumptions of Linear Regression

Linearity and Independence

Homoscedasticity and Normality

Multicollinearity

Regression Model Diagnostics

Residual Analysis

Heteroscedasticity and Normality Tests

Model Selection Techniques

Subset Selection Methods

Information Criteria and Adjusted R-squared

Cross-Validation

Limitations of Regression Analysis

Causality and Outliers

Extrapolation and Omitted Variables

Measurement Errors and Model Assumptions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

📊Advanced Quantitative Methods
Unit 6 Review