Fiveable

๐ŸŽฒData Science Statistics Unit 17 Review

QR code for Data Science Statistics practice questions

17.2 Regularization Techniques (Lasso, Ridge)

๐ŸŽฒData Science Statistics
Unit 17 Review

17.2 Regularization Techniques (Lasso, Ridge)

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸŽฒData Science Statistics
Unit & Topic Study Guides

Regularization techniques like Lasso and Ridge help prevent overfitting in statistical models. By adding penalties to the loss function, these methods shrink coefficients, promoting simpler models that generalize better to new data.

L1 (Lasso) and L2 (Ridge) regularization differ in their effects. Lasso can drive coefficients to zero, aiding feature selection, while Ridge shrinks coefficients without eliminating them. Both techniques balance model complexity and performance, improving predictions.

Regularization Techniques

L1 and L2 Regularization

  • L1 regularization (Lasso) adds absolute value of coefficients to loss function
    • Promotes sparsity by driving some coefficients to exactly zero
    • Useful for feature selection
    • Mathematically expressed as Loss+ฮปโˆ‘i=1nโˆฃฮฒiโˆฃ\text{Loss} + \lambda \sum_{i=1}^{n} |\beta_i|
  • L2 regularization (Ridge) adds squared magnitudes of coefficients to loss function
    • Shrinks coefficients towards zero but rarely to exactly zero
    • Effective for handling multicollinearity
    • Mathematically expressed as Loss+ฮปโˆ‘i=1nฮฒi2\text{Loss} + \lambda \sum_{i=1}^{n} \beta_i^2
  • Regularization parameter (ฮป) controls strength of regularization
    • Larger ฮป values increase regularization effect
    • Smaller ฮป values decrease regularization effect
    • Optimal ฮป often determined through cross-validation

Advanced Regularization Methods

  • Elastic Net combines L1 and L2 regularization
    • Balances feature selection and coefficient shrinkage
    • Mathematically expressed as Loss+ฮป1โˆ‘i=1nโˆฃฮฒiโˆฃ+ฮป2โˆ‘i=1nฮฒi2\text{Loss} + \lambda_1 \sum_{i=1}^{n} |\beta_i| + \lambda_2 \sum_{i=1}^{n} \beta_i^2
    • Useful when dealing with correlated predictors
  • Shrinkage reduces magnitude of model coefficients
    • Helps prevent overfitting by constraining model complexity
    • L1 and L2 regularization both induce shrinkage
  • Sparsity refers to models with few non-zero coefficients
    • L1 regularization promotes sparsity
    • Leads to simpler, more interpretable models
    • Useful in high-dimensional settings (gene expression data)

Model Evaluation

Understanding Model Fit

  • Overfitting occurs when model learns noise in training data
    • Results in poor generalization to new, unseen data
    • Characterized by low training error but high test error
    • Can be addressed through regularization or increasing training data
  • Underfitting happens when model is too simple to capture underlying patterns
    • Results in poor performance on both training and test data
    • Characterized by high bias
    • Can be addressed by increasing model complexity or adding features
  • Bias-variance tradeoff balances model simplicity and complexity
    • Bias measures systematic error due to model assumptions
    • Variance measures model sensitivity to fluctuations in training data
    • Total error = Bias^2 + Variance + Irreducible Error
    • Optimal model minimizes total error

Cross-validation Techniques

  • Cross-validation assesses model performance on unseen data
    • Helps detect overfitting and estimate generalization error
    • K-fold cross-validation divides data into K subsets
      • Train on K-1 subsets, validate on remaining subset
      • Repeat K times, rotating validation set
    • Leave-one-out cross-validation uses single observation for validation
      • Computationally expensive but useful for small datasets
    • Stratified cross-validation maintains class proportions in each fold
      • Useful for imbalanced datasets

Feature Selection and Regression

Feature Selection Techniques

  • Feature selection identifies most relevant predictors
    • Improves model interpretability and reduces overfitting
    • Can be performed using wrapper, filter, or embedded methods
  • Wrapper methods use model performance to select features
    • Forward selection starts with no features, adds one at a time
    • Backward elimination starts with all features, removes one at a time
    • Recursive feature elimination iteratively removes least important features
  • Filter methods use statistical measures to select features
    • Correlation-based selection chooses features highly correlated with target
    • Mutual information quantifies dependency between feature and target
    • Variance threshold removes features with low variance

Regularized Linear Regression

  • Regularized linear regression incorporates penalties into model fitting
    • Lasso regression uses L1 regularization
    • Ridge regression uses L2 regularization
    • Elastic Net combines L1 and L2 regularization
  • Coefficient paths visualize how coefficients change with regularization strength
    • X-axis represents regularization parameter (ฮป)
    • Y-axis shows coefficient values
    • Lasso paths can reach exactly zero, indicating feature elimination
    • Ridge paths asymptotically approach zero but never reach it
  • Regularized regression models implemented in various libraries
    • Scikit-learn provides Lasso, Ridge, and ElasticNet classes
    • Statsmodels offers OLS with regularization options
    • Regularization strength typically tuned using cross-validation