Fiveable

๐Ÿค–Statistical Prediction Unit 12 Review

QR code for Statistical Prediction practice questions

12.1 Model Selection Criteria and Information Theoretic Approaches

๐Ÿค–Statistical Prediction
Unit 12 Review

12.1 Model Selection Criteria and Information Theoretic Approaches

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿค–Statistical Prediction
Unit & Topic Study Guides

Model selection criteria help us choose the best model for our data. They balance how well a model fits with how complex it is. This is crucial for avoiding overfitting and finding the most accurate predictions.

Information-theoretic approaches like AIC, BIC, and MDL provide ways to compare models. These methods, along with goodness-of-fit measures, help us evaluate and select the most appropriate model for our specific dataset and problem.

Model Selection Criteria

Information-Theoretic Approaches

  • Akaike Information Criterion (AIC) estimates the quality of each model relative to other models for a given set of data
    • Balances goodness of fit with model complexity
    • Calculated as: $AIC = 2k - 2ln(L)$, where $k$ is the number of parameters and $L$ is the likelihood function
  • Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models that is closely related to AIC
    • Tends to penalize model complexity more heavily than AIC
    • Calculated as: $BIC = ln(n)k - 2ln(L)$, where $n$ is the number of observations, $k$ is the number of parameters, and $L$ is the likelihood function
  • Minimum Description Length (MDL) is a formalization of Occam's razor, where the best model is the one that provides the shortest description of the data
    • Balances model complexity and goodness of fit by minimizing the sum of the description length of the model and the description length of the data given the model
    • Can be used for model selection, feature selection, and dimensionality reduction

Goodness-of-Fit Measures

  • Mallow's Cp is a measure of the bias in a model, where a lower value indicates a better model
    • Compares the precision and bias of the full model to models with subsets of predictors
    • Calculated as: $Cp = (RSS_p / s^2) - (n - 2p)$, where $RSS_p$ is the residual sum of squares for the model with $p$ predictors, $s^2$ is the mean squared error for the full model, and $n$ is the number of observations
  • Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a model
    • Increases only if the new term improves the model more than would be expected by chance
    • Calculated as: $1 - [(1 - R^2)(n - 1) / (n - k - 1)]$, where $n$ is the number of observations and $k$ is the number of predictors

Model Evaluation Techniques

Cross-Validation

  • Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample
    • Involves partitioning the data into subsets, training the model on a subset, and validating the model on the remaining data
    • Common types include k-fold cross-validation, leave-one-out cross-validation, and stratified k-fold cross-validation
  • k-fold cross-validation divides the data into k subsets, trains the model on k-1 subsets, and validates on the remaining subset
    • Repeated k times, with each subset used as the validation set once
    • Provides a more robust estimate of model performance compared to a single train-test split
  • Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation where k equals the number of observations
    • Each observation is used as the validation set once, while the remaining observations form the training set
    • Computationally expensive but provides an unbiased estimate of model performance

Model Complexity and Performance

Bias-Variance Tradeoff

  • Overfitting occurs when a model learns the noise in the training data to the extent that it negatively impacts the performance of the model on new data
    • Often results from a model that is too complex, such as having too many parameters relative to the number of observations
    • Techniques to mitigate overfitting include regularization, cross-validation, and early stopping
  • Underfitting occurs when a model is too simple to learn the underlying structure of the data
    • Often results in high bias and low variance
    • Can be addressed by increasing model complexity, adding features, or decreasing regularization
  • Bias-variance tradeoff is the balance between the error introduced by the bias (underfitting) and the error introduced by the variance (overfitting)
    • Models with high bias are less complex and may underfit the data, while models with high variance are more complex and may overfit the data
    • The goal is to find the sweet spot where the model is complex enough to learn the underlying structure but not so complex that it learns the noise