Measures of model fit help us gauge how well our linear regression model explains the data. R-squared tells us what percentage of variation in the dependent variable our model accounts for, ranging from 0 to 1.
While R-squared is useful, it has limitations. Enter adjusted R-squared, which penalizes adding unnecessary variables. This helps us avoid overfitting and compare models with different numbers of predictors more accurately.
Coefficient of determination (R-squared)
Definition and interpretation
- R-squared is a statistical measure representing the proportion of variance in the dependent variable predictable from the independent variable(s) in a linear regression model
- Ranges from 0 to 1, with higher values indicating a better fit of the model to the data
- An R-squared of 1 means the model explains all the variability of the response data around its mean
- Interprets the percentage of variation in the dependent variable explainable by the independent variable(s) in the model
- Also known as the coefficient of determination, commonly used to assess the goodness of fit of a linear regression model
- Formula for R-squared: $R-squared = 1 - (SSR / SST)$, where SSR is the sum of squared residuals and SST is the total sum of squares
Importance and usage
- R-squared provides a quantitative measure of how well the linear regression model fits the observed data
- Helps evaluate the strength of the relationship between the dependent and independent variables
- Allows comparison of different models to determine which one better explains the variability in the data
- Widely used in various fields (economics, social sciences, engineering) to assess the explanatory power of linear regression models
Calculating R-squared
Required components
- To calculate R-squared, you need the sum of squared residuals (SSR) and the total sum of squares (SST) from the linear regression model
- SSR is the sum of the squared differences between the predicted values and the actual values of the dependent variable
- Represents the amount of variation in the dependent variable not explained by the model
- SST is the sum of the squared differences between the actual values of the dependent variable and its mean
- Represents the total variation in the dependent variable
Calculation methods
- Once you have SSR and SST, use the formula $R-squared = 1 - (SSR / SST)$ to calculate R-squared
- Alternatively, most statistical software packages (SPSS, R) and programming languages (Python) provide functions to directly compute R-squared for a given linear regression model
- Example in R:
summary(lm_model)$r.squared
returns the R-squared value for the linear modellm_model
- Example in Python with scikit-learn:
from sklearn.metrics import r2_score; r2_score(y_true, y_pred)
calculates R-squared given the true values (y_true
) and predicted values (y_pred
)
- Example in R:
R-squared limitations vs adjusted R-squared
Limitations of R-squared
- R-squared increases as more independent variables are added to the model, even if those variables do not have a significant impact on the dependent variable
- This can lead to the inclusion of irrelevant variables and overfitting
- Does not indicate whether the independent variables are statistically significant or if the model is appropriate for the data
- Only measures the goodness of fit without considering the model's validity
- Does not consider the number of independent variables in the model, potentially leading to overfitting if too many variables are included
Adjusted R-squared as an alternative
- Adjusted R-squared addresses the limitations of R-squared by adjusting for the number of independent variables in the model
- Penalizes the addition of unnecessary independent variables, providing a more reliable measure of the model's goodness of fit
- Particularly useful when comparing models with different numbers of independent variables
- Helps determine if adding more variables truly improves the model's explanatory power
Adjusted R-squared interpretation
Calculation and formula
- Adjusted R-squared is calculated using the formula: $Adjusted R-squared = 1 - [(1 - R-squared) (n - 1) / (n - k - 1)]$, where n is the number of observations and k is the number of independent variables in the model
- The adjusted R-squared value will always be less than or equal to the R-squared value
- Decreases when the number of independent variables increases without a corresponding improvement in the model's fit
Interpretation and comparison
- The interpretation of adjusted R-squared is similar to R-squared
- Represents the proportion of variance in the dependent variable predictable from the independent variable(s), adjusted for the number of variables in the model
- A higher adjusted R-squared value indicates a better fit of the model to the data, considering the number of independent variables used
- When comparing models with different numbers of independent variables, adjusted R-squared is a more appropriate measure than R-squared
- Helps identify the model that strikes a balance between explanatory power and parsimony (using fewer variables)