Fiveable

๐Ÿค–Statistical Prediction Unit 14 Review

QR code for Statistical Prediction practice questions

14.1 Regression Metrics: MSE, RMSE, MAE, and R-squared

๐Ÿค–Statistical Prediction
Unit 14 Review

14.1 Regression Metrics: MSE, RMSE, MAE, and R-squared

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿค–Statistical Prediction
Unit & Topic Study Guides

Regression metrics help us gauge how well our models predict outcomes. MSE, RMSE, and MAE measure prediction errors, while R-squared shows how much variation our model explains. These tools are crucial for evaluating and comparing regression models.

Understanding these metrics is key to assessing model performance in real-world scenarios. They help us identify which models are most accurate and reliable, guiding us in making better predictions and decisions based on our data.

Error Metrics

Measuring Prediction Errors

  • Mean Squared Error (MSE) calculates the average squared difference between the predicted and actual values
    • Formula: $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
    • Squaring the errors amplifies larger errors and minimizes smaller ones
    • Sensitive to outliers due to the squaring of errors
  • Root Mean Squared Error (RMSE) takes the square root of the MSE to bring the units back to the original scale
    • Formula: $RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$
    • Easier to interpret than MSE as it is in the same units as the target variable
    • Still sensitive to outliers, but less so than MSE
  • Mean Absolute Error (MAE) calculates the average absolute difference between the predicted and actual values
    • Formula: $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$
    • Less sensitive to outliers compared to MSE and RMSE
    • Provides a more intuitive understanding of the average error magnitude

Percentage-based Error Metric

  • Mean Absolute Percentage Error (MAPE) expresses the average absolute error as a percentage of the actual values
    • Formula: $MAPE = \frac{100%}{n} \sum_{i=1}^{n} |\frac{y_i - \hat{y}_i}{y_i}|$
    • Useful when the target variable has a wide range of values or when comparing models across different datasets
    • Can be misleading when actual values are close to zero, as it can lead to large percentage errors

Analyzing Model Residuals

  • Residuals represent the differences between the predicted and actual values
    • Formula: $residual_i = y_i - \hat{y}_i$
    • Positive residuals indicate underestimation, while negative residuals indicate overestimation
    • Analyzing residuals helps assess model assumptions and identify patterns or biases in the predictions
    • Residual plots (residuals vs. predicted values) can reveal non-linear relationships or heteroscedasticity

Coefficient of Determination

Measuring Model Fit

  • R-squared (Coefficient of Determination) measures the proportion of variance in the target variable explained by the model
    • Formula: $R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}$
    • Ranges from 0 to 1, with higher values indicating a better fit
    • Represents the improvement of the model compared to using the mean of the target variable as a prediction
    • Can be interpreted as the percentage of variance explained by the model (e.g., R-squared of 0.75 means 75% of the variance is explained)
  • Adjusted R-squared penalizes the addition of unnecessary predictors to the model
    • Formula: $Adjusted R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$, where $p$ is the number of predictors
    • Useful for comparing models with different numbers of predictors
    • Prevents overfitting by discouraging the inclusion of irrelevant variables

Assessing Model Goodness of Fit

  • Goodness of fit refers to how well the model fits the observed data
    • A high R-squared or adjusted R-squared indicates a good fit, meaning the model captures a significant portion of the variability in the target variable
    • However, a high R-squared does not necessarily imply a good model, as it can be affected by outliers or overfitting
    • It is important to consider other diagnostic measures (residual plots, cross-validation) alongside R-squared to assess model performance and validity