Fiveable

๐ŸŽณIntro to Econometrics Unit 6 Review

QR code for Intro to Econometrics practice questions

6.5 Heckman selection model

๐ŸŽณIntro to Econometrics
Unit 6 Review

6.5 Heckman selection model

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸŽณIntro to Econometrics
Unit & Topic Study Guides

The Heckman selection model addresses sample selection bias in econometrics, ensuring consistent parameter estimates when data is non-randomly missing. It uses a two-equation system: a selection equation for sample inclusion probability and an outcome equation for the relationship between variables.

This model is crucial when standard regression assumptions of random sampling are violated. It corrects for selection bias in various scenarios, such as labor market participation, program evaluation, and healthcare utilization, where self-selection or non-response can skew results.

Overview of Heckman selection model

  • The Heckman selection model is a statistical approach used in econometrics to address sample selection bias and estimate consistent parameters in the presence of non-random missing data
  • It consists of a two-equation system: a selection equation that models the probability of an observation being selected into the sample, and an outcome equation that models the relationship between the dependent variable and explanatory variables for the selected observations
  • The model allows for the estimation of the effect of explanatory variables on the outcome variable while accounting for the non-random selection process

Motivation for selection models

Limitations of standard regression

  • Standard regression models assume that the sample is randomly selected and representative of the population of interest
  • However, in many real-world situations, the sample may be non-randomly selected due to self-selection, sample attrition, or other factors
  • Ignoring the selection process can lead to biased and inconsistent parameter estimates

Presence of selection bias

  • Selection bias occurs when the probability of an observation being included in the sample is related to the outcome variable of interest
  • This can happen when individuals self-select into a program or treatment based on unobserved factors that also influence the outcome
  • Selection bias can also arise due to non-response or missing data in surveys or experiments

Examples of selection bias

  • Labor market participation: individuals with higher earning potential may be more likely to participate in the labor market, leading to a non-random sample of observed wages
  • Program evaluation: individuals who choose to participate in a training program may have different unobserved characteristics compared to those who do not participate, affecting the estimated impact of the program
  • Healthcare utilization: individuals who seek healthcare may have different health status or preferences compared to those who do not, leading to biased estimates of the effect of healthcare on health outcomes

Two-step estimation procedure

Step 1: Selection equation

  • The selection equation is a probit or logit model that estimates the probability of an observation being selected into the sample
  • It includes explanatory variables that are thought to influence the selection process but may not necessarily affect the outcome variable directly
  • The selection equation is used to calculate the inverse Mills ratio, which captures the effect of the selection process on the outcome

Step 2: Outcome equation

  • The outcome equation is a linear regression model that estimates the relationship between the dependent variable and explanatory variables for the selected observations
  • It includes the inverse Mills ratio as an additional explanatory variable to control for the selection bias
  • The coefficients in the outcome equation represent the effect of the explanatory variables on the outcome, conditional on being selected into the sample

Inverse Mills ratio

  • The inverse Mills ratio is a term derived from the selection equation that captures the effect of the selection process on the outcome
  • It is calculated as the ratio of the probability density function to the cumulative distribution function of the selection equation residuals
  • Including the inverse Mills ratio in the outcome equation helps to correct for the selection bias and obtain consistent parameter estimates

Interpretation of coefficients

  • The coefficients in the outcome equation can be interpreted as the marginal effect of the explanatory variables on the outcome, conditional on being selected into the sample
  • The coefficient on the inverse Mills ratio represents the correlation between the unobserved factors that influence selection and the unobserved factors that influence the outcome
  • A statistically significant coefficient on the inverse Mills ratio indicates the presence of selection bias and the need for the Heckman correction

Maximum likelihood estimation

Joint distribution of errors

  • The Heckman selection model can also be estimated using maximum likelihood estimation (MLE)
  • MLE assumes a joint distribution of the errors in the selection and outcome equations, typically a bivariate normal distribution
  • The joint distribution allows for the estimation of the correlation between the unobserved factors in the selection and outcome equations

Log-likelihood function

  • The log-likelihood function for the Heckman model is derived based on the joint distribution of the errors
  • It consists of two parts: the contribution of the selected observations to the likelihood and the contribution of the non-selected observations
  • The log-likelihood function is maximized with respect to the parameters in the selection and outcome equations, as well as the correlation between the errors

Comparison vs two-step approach

  • MLE is more efficient than the two-step approach when the assumptions of the model are satisfied, as it uses all available information in the estimation process
  • However, MLE is more computationally intensive and may be more sensitive to misspecification of the joint distribution of the errors
  • The two-step approach is easier to implement and may be more robust to misspecification, but it is less efficient than MLE

Identification in selection models

Exclusion restrictions

  • Identification in the Heckman model requires that there is at least one variable in the selection equation that is not included in the outcome equation (an exclusion restriction)
  • The exclusion restriction should be a variable that affects the probability of selection but does not directly influence the outcome variable
  • Examples of exclusion restrictions may include variables related to the selection process, such as the availability of the program or the distance to the program site

Nonlinearity as identification

  • In some cases, identification can be achieved through the nonlinearity of the selection equation, even without an exclusion restriction
  • The nonlinearity in the probit or logit model can provide sufficient variation to identify the parameters in the outcome equation
  • However, relying on nonlinearity for identification may lead to less precise estimates and may be more sensitive to functional form assumptions

Assumptions of Heckman model

Normality of errors

  • The Heckman model assumes that the errors in the selection and outcome equations follow a bivariate normal distribution
  • This assumption is necessary for the consistency of the parameter estimates and the validity of the statistical inference
  • Violations of the normality assumption can lead to biased estimates and incorrect standard errors

Homoskedasticity

  • The model assumes that the errors in the outcome equation have constant variance (homoskedasticity)
  • If the errors are heteroskedastic, the standard errors of the coefficients may be incorrect, leading to invalid inference
  • Heteroskedasticity-robust standard errors can be used to address this issue

Independence of errors

  • The Heckman model assumes that the errors in the selection and outcome equations are independent of the explanatory variables
  • This assumption is necessary for the consistency of the parameter estimates
  • If the errors are correlated with the explanatory variables (endogeneity), the estimates may be biased, and alternative methods, such as instrumental variables, may be needed

Marginal effects in selection models

Conditional marginal effects

  • Conditional marginal effects measure the effect of a change in an explanatory variable on the outcome, conditional on being selected into the sample
  • They can be calculated by taking the partial derivative of the outcome equation with respect to the explanatory variable of interest, while holding the inverse Mills ratio constant
  • Conditional marginal effects provide insight into the relationship between the explanatory variables and the outcome for the selected observations

Unconditional marginal effects

  • Unconditional marginal effects measure the effect of a change in an explanatory variable on the outcome, taking into account both the direct effect on the outcome and the indirect effect through the selection process
  • They can be calculated by combining the marginal effects from the selection and outcome equations, weighted by the probability of selection
  • Unconditional marginal effects provide a more comprehensive measure of the impact of the explanatory variables on the outcome for the entire population

Strengths of Heckman approach

Correcting for selection bias

  • The Heckman selection model addresses the issue of selection bias by explicitly modeling the selection process and including a correction term (inverse Mills ratio) in the outcome equation
  • By accounting for the non-random selection, the Heckman approach helps to obtain consistent estimates of the parameters in the presence of selection bias
  • This is particularly useful in situations where the sample is not representative of the population of interest due to self-selection or non-response

Consistent parameter estimates

  • When the assumptions of the Heckman model are satisfied, the parameter estimates obtained from the two-step or maximum likelihood estimation are consistent
  • Consistency means that the estimates converge to the true population parameters as the sample size increases
  • Consistent estimates are important for making reliable inferences and policy recommendations based on the results of the analysis

Limitations of Heckman model

Sensitivity to distributional assumptions

  • The Heckman model relies on the assumption of bivariate normality of the errors in the selection and outcome equations
  • Violations of this assumption can lead to biased and inconsistent estimates
  • The model may be sensitive to misspecification of the joint distribution, and alternative distributional assumptions (e.g., multivariate t-distribution) may be considered

Difficulty finding exclusion restrictions

  • Identification in the Heckman model often relies on the availability of valid exclusion restrictions (variables that affect selection but not the outcome)
  • Finding suitable exclusion restrictions can be challenging in practice, as it requires a deep understanding of the selection process and the factors that influence it
  • Weak or invalid exclusion restrictions can lead to imprecise estimates and sensitivity to model specification

Applications of selection models

Labor market participation

  • Heckman selection models are widely used in labor economics to study wage determination and labor market outcomes
  • The selection equation models the decision to participate in the labor market, while the outcome equation models the wage earned by those who participate
  • The model helps to correct for the selection bias that arises from the fact that wages are only observed for individuals who choose to work

Program evaluation

  • Selection models are also used in program evaluation to estimate the impact of a treatment or intervention on an outcome of interest
  • The selection equation models the decision to participate in the program, while the outcome equation models the effect of the program on the outcome for those who participate
  • The Heckman approach helps to address the issue of self-selection into the program, which can lead to biased estimates of the program's impact

Healthcare utilization

  • In health economics, selection models are used to study the determinants of healthcare utilization and the effect of healthcare on health outcomes
  • The selection equation models the decision to seek healthcare, while the outcome equation models the relationship between healthcare utilization and health outcomes
  • The Heckman approach helps to account for the non-random selection of individuals into healthcare, which can be influenced by factors such as health status and preferences

Alternatives to Heckman model

Instrumental variables approach

  • Instrumental variables (IV) can be used as an alternative to the Heckman model when there are concerns about endogeneity in the explanatory variables
  • The IV approach relies on finding a variable (instrument) that is correlated with the endogenous explanatory variable but uncorrelated with the error term in the outcome equation
  • The IV estimator provides consistent estimates of the parameters, but it may be less efficient than the Heckman approach if the assumptions of the Heckman model are satisfied

Propensity score matching

  • Propensity score matching (PSM) is a non-parametric method for addressing selection bias in observational studies
  • PSM involves estimating the probability of selection (propensity score) based on observed characteristics and then matching treated and untreated observations based on their propensity scores
  • PSM can be used as a preprocessing step to create a balanced sample before applying standard regression techniques

Control function approach

  • The control function approach is similar to the Heckman model in that it involves estimating a selection equation and including a correction term in the outcome equation
  • However, the control function approach is more flexible in terms of the distributional assumptions and can accommodate non-normality and heteroskedasticity in the errors
  • The control function approach is particularly useful when there are concerns about the validity of the exclusion restrictions or the functional form of the selection equation