Fiveable

๐ŸชšPublic Policy Analysis Unit 11 Review

QR code for Public Policy Analysis practice questions

11.2 Regression Analysis and Modeling

๐ŸชšPublic Policy Analysis
Unit 11 Review

11.2 Regression Analysis and Modeling

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸชšPublic Policy Analysis
Unit & Topic Study Guides

Regression analysis is a powerful tool in policy analysis, helping us understand relationships between variables. It allows us to predict outcomes and measure the impact of different factors on policy issues. From simple linear models to complex logistic regressions, these techniques offer valuable insights.

Understanding variables, coefficients, and model evaluation metrics is crucial for interpreting regression results. We'll look at common issues like multicollinearity and heteroscedasticity, and learn how to address them to ensure our analyses are accurate and reliable.

Regression Models

Linear Regression

  • Simple linear regression models the relationship between two variables using a straight line
  • Assumes a linear relationship exists between the dependent variable and a single independent variable
  • Equation takes the form $y = \beta_0 + \beta_1x + \varepsilon$, where $y$ is the dependent variable, $x$ is the independent variable, $\beta_0$ is the y-intercept, $\beta_1$ is the slope, and $\varepsilon$ is the error term
  • Can predict values of the dependent variable based on the independent variable (housing prices based on square footage)

Multiple Regression

  • Extends linear regression to include multiple independent variables
  • Models the relationship between the dependent variable and two or more independent variables
  • Equation takes the form $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon$, where $y$ is the dependent variable, $x_1, x_2, ..., x_n$ are the independent variables, $\beta_0$ is the y-intercept, $\beta_1, \beta_2, ..., \beta_n$ are the coefficients, and $\varepsilon$ is the error term
  • Allows for more complex relationships and interactions between variables to be captured (predicting salary based on education level, years of experience, and job title)

Logistic Regression

  • Used when the dependent variable is binary or categorical (pass/fail, yes/no)
  • Models the probability of an event occurring based on one or more independent variables
  • Employs a logistic function to transform the output to a probability between 0 and 1
  • Equation takes the form $\ln(\frac{p}{1-p}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$, where $p$ is the probability of the event occurring, $x_1, x_2, ..., x_n$ are the independent variables, and $\beta_0, \beta_1, \beta_2, ..., \beta_n$ are the coefficients
  • Can be used for classification problems (predicting whether a customer will churn based on their demographics and behavior)

Variables and Coefficients

Dependent and Independent Variables

  • Dependent variable is the outcome or response variable that is being predicted or explained by the model (house price, test score)
  • Independent variables, also known as predictor or explanatory variables, are the factors used to predict or explain the dependent variable (square footage, hours studied)
  • Choice of dependent and independent variables depends on the research question and the hypothesized relationships between variables

Coefficients

  • Coefficients represent the change in the dependent variable associated with a one-unit change in the corresponding independent variable, holding all other variables constant
  • Interpretation depends on the scale and units of the variables involved
  • In linear regression, the coefficient of the independent variable is the slope of the line (a coefficient of 1.5 means that for every one-unit increase in the independent variable, the dependent variable increases by 1.5 units on average)
  • In logistic regression, coefficients are interpreted in terms of odds ratios (a coefficient of 0.7 means that a one-unit increase in the independent variable is associated with a 70% increase in the odds of the event occurring)

Model Evaluation Metrics

R-squared

  • R-squared, or the coefficient of determination, measures the proportion of variance in the dependent variable that is explained by the independent variables in the model
  • Ranges from 0 to 1, with higher values indicating a better fit
  • Calculated as the ratio of the explained variance to the total variance, $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals and $SS_{tot}$ is the total sum of squares
  • Adjusted R-squared accounts for the number of independent variables in the model and penalizes the addition of irrelevant variables

Residuals

  • Residuals are the differences between the observed values of the dependent variable and the predicted values from the regression model
  • Calculated as $e_i = y_i - \hat{y}_i$, where $e_i$ is the residual for observation $i$, $y_i$ is the observed value, and $\hat{y}_i$ is the predicted value
  • Used to assess the assumptions of the model (normality, homoscedasticity, linearity) and identify outliers or influential observations
  • Plotting residuals against predicted values or independent variables can reveal patterns or deviations from assumptions (a funnel-shaped plot may indicate heteroscedasticity)

Common Issues

Multicollinearity

  • Multicollinearity occurs when two or more independent variables in a multiple regression model are highly correlated with each other
  • Can lead to unstable and unreliable coefficient estimates, as it becomes difficult to separate the individual effects of the correlated variables
  • Detected through correlation matrices, variance inflation factors (VIFs), or condition indices
  • Addressed by removing one of the correlated variables, combining them into a single variable, or using regularization techniques like ridge regression or lasso regression

Heteroscedasticity

  • Heteroscedasticity refers to the situation where the variance of the residuals is not constant across the range of the independent variables
  • Violates the assumption of homoscedasticity in linear regression, which states that the variance of the residuals should be constant
  • Can lead to biased standard errors and incorrect inferences about the significance of the coefficients
  • Detected through visual inspection of residual plots (residuals vs. fitted values) or formal tests like the Breusch-Pagan test or White's test
  • Addressed by using robust standard errors, weighted least squares, or transforming the variables to stabilize the variance (taking the logarithm of the dependent variable)