🎳Intro to Econometrics Unit 6 Review

6.5 Heckman selection model

🎳Intro to Econometrics
Unit 6 Review

6.5 Heckman selection model

Written by the Fiveable Content Team • Last updated September 2025

🎳Intro to Econometrics

Unit & Topic Study Guides

6.1 Dummy variables

6.2 Interaction terms

6.3 Chow tests

6.4 Sample selection bias

6.5 Heckman selection model

The Heckman selection model addresses sample selection bias in econometrics, ensuring consistent parameter estimates when data is non-randomly missing. It uses a two-equation system: a selection equation for sample inclusion probability and an outcome equation for the relationship between variables.

This model is crucial when standard regression assumptions of random sampling are violated. It corrects for selection bias in various scenarios, such as labor market participation, program evaluation, and healthcare utilization, where self-selection or non-response can skew results.

Overview of Heckman selection model

The Heckman selection model is a statistical approach used in econometrics to address sample selection bias and estimate consistent parameters in the presence of non-random missing data
It consists of a two-equation system: a selection equation that models the probability of an observation being selected into the sample, and an outcome equation that models the relationship between the dependent variable and explanatory variables for the selected observations
The model allows for the estimation of the effect of explanatory variables on the outcome variable while accounting for the non-random selection process

Motivation for selection models

Limitations of standard regression

Standard regression models assume that the sample is randomly selected and representative of the population of interest
However, in many real-world situations, the sample may be non-randomly selected due to self-selection, sample attrition, or other factors
Ignoring the selection process can lead to biased and inconsistent parameter estimates

Presence of selection bias

Selection bias occurs when the probability of an observation being included in the sample is related to the outcome variable of interest
This can happen when individuals self-select into a program or treatment based on unobserved factors that also influence the outcome
Selection bias can also arise due to non-response or missing data in surveys or experiments

Examples of selection bias

Labor market participation: individuals with higher earning potential may be more likely to participate in the labor market, leading to a non-random sample of observed wages
Program evaluation: individuals who choose to participate in a training program may have different unobserved characteristics compared to those who do not participate, affecting the estimated impact of the program
Healthcare utilization: individuals who seek healthcare may have different health status or preferences compared to those who do not, leading to biased estimates of the effect of healthcare on health outcomes

Two-step estimation procedure

Step 1: Selection equation

The selection equation is a probit or logit model that estimates the probability of an observation being selected into the sample
It includes explanatory variables that are thought to influence the selection process but may not necessarily affect the outcome variable directly
The selection equation is used to calculate the inverse Mills ratio, which captures the effect of the selection process on the outcome

Step 2: Outcome equation

The outcome equation is a linear regression model that estimates the relationship between the dependent variable and explanatory variables for the selected observations
It includes the inverse Mills ratio as an additional explanatory variable to control for the selection bias
The coefficients in the outcome equation represent the effect of the explanatory variables on the outcome, conditional on being selected into the sample

Inverse Mills ratio

The inverse Mills ratio is a term derived from the selection equation that captures the effect of the selection process on the outcome
It is calculated as the ratio of the probability density function to the cumulative distribution function of the selection equation residuals
Including the inverse Mills ratio in the outcome equation helps to correct for the selection bias and obtain consistent parameter estimates

Interpretation of coefficients

The coefficients in the outcome equation can be interpreted as the marginal effect of the explanatory variables on the outcome, conditional on being selected into the sample
The coefficient on the inverse Mills ratio represents the correlation between the unobserved factors that influence selection and the unobserved factors that influence the outcome
A statistically significant coefficient on the inverse Mills ratio indicates the presence of selection bias and the need for the Heckman correction

Maximum likelihood estimation

Joint distribution of errors

The Heckman selection model can also be estimated using maximum likelihood estimation (MLE)
MLE assumes a joint distribution of the errors in the selection and outcome equations, typically a bivariate normal distribution
The joint distribution allows for the estimation of the correlation between the unobserved factors in the selection and outcome equations

Log-likelihood function

The log-likelihood function for the Heckman model is derived based on the joint distribution of the errors
It consists of two parts: the contribution of the selected observations to the likelihood and the contribution of the non-selected observations
The log-likelihood function is maximized with respect to the parameters in the selection and outcome equations, as well as the correlation between the errors

Comparison vs two-step approach

MLE is more efficient than the two-step approach when the assumptions of the model are satisfied, as it uses all available information in the estimation process
However, MLE is more computationally intensive and may be more sensitive to misspecification of the joint distribution of the errors
The two-step approach is easier to implement and may be more robust to misspecification, but it is less efficient than MLE

Identification in selection models

Exclusion restrictions

Identification in the Heckman model requires that there is at least one variable in the selection equation that is not included in the outcome equation (an exclusion restriction)
The exclusion restriction should be a variable that affects the probability of selection but does not directly influence the outcome variable
Examples of exclusion restrictions may include variables related to the selection process, such as the availability of the program or the distance to the program site

Nonlinearity as identification

In some cases, identification can be achieved through the nonlinearity of the selection equation, even without an exclusion restriction
The nonlinearity in the probit or logit model can provide sufficient variation to identify the parameters in the outcome equation
However, relying on nonlinearity for identification may lead to less precise estimates and may be more sensitive to functional form assumptions

Assumptions of Heckman model

Normality of errors

The Heckman model assumes that the errors in the selection and outcome equations follow a bivariate normal distribution
This assumption is necessary for the consistency of the parameter estimates and the validity of the statistical inference
Violations of the normality assumption can lead to biased estimates and incorrect standard errors

Homoskedasticity

The model assumes that the errors in the outcome equation have constant variance (homoskedasticity)
If the errors are heteroskedastic, the standard errors of the coefficients may be incorrect, leading to invalid inference
Heteroskedasticity-robust standard errors can be used to address this issue

Independence of errors

The Heckman model assumes that the errors in the selection and outcome equations are independent of the explanatory variables
This assumption is necessary for the consistency of the parameter estimates
If the errors are correlated with the explanatory variables (endogeneity), the estimates may be biased, and alternative methods, such as instrumental variables, may be needed

Marginal effects in selection models

Conditional marginal effects

Conditional marginal effects measure the effect of a change in an explanatory variable on the outcome, conditional on being selected into the sample
They can be calculated by taking the partial derivative of the outcome equation with respect to the explanatory variable of interest, while holding the inverse Mills ratio constant
Conditional marginal effects provide insight into the relationship between the explanatory variables and the outcome for the selected observations

Unconditional marginal effects

Unconditional marginal effects measure the effect of a change in an explanatory variable on the outcome, taking into account both the direct effect on the outcome and the indirect effect through the selection process
They can be calculated by combining the marginal effects from the selection and outcome equations, weighted by the probability of selection
Unconditional marginal effects provide a more comprehensive measure of the impact of the explanatory variables on the outcome for the entire population

Strengths of Heckman approach

Correcting for selection bias

The Heckman selection model addresses the issue of selection bias by explicitly modeling the selection process and including a correction term (inverse Mills ratio) in the outcome equation
By accounting for the non-random selection, the Heckman approach helps to obtain consistent estimates of the parameters in the presence of selection bias
This is particularly useful in situations where the sample is not representative of the population of interest due to self-selection or non-response

Consistent parameter estimates

When the assumptions of the Heckman model are satisfied, the parameter estimates obtained from the two-step or maximum likelihood estimation are consistent
Consistency means that the estimates converge to the true population parameters as the sample size increases
Consistent estimates are important for making reliable inferences and policy recommendations based on the results of the analysis

Limitations of Heckman model

Sensitivity to distributional assumptions

The Heckman model relies on the assumption of bivariate normality of the errors in the selection and outcome equations
Violations of this assumption can lead to biased and inconsistent estimates
The model may be sensitive to misspecification of the joint distribution, and alternative distributional assumptions (e.g., multivariate t-distribution) may be considered

Difficulty finding exclusion restrictions

Identification in the Heckman model often relies on the availability of valid exclusion restrictions (variables that affect selection but not the outcome)
Finding suitable exclusion restrictions can be challenging in practice, as it requires a deep understanding of the selection process and the factors that influence it
Weak or invalid exclusion restrictions can lead to imprecise estimates and sensitivity to model specification

Applications of selection models

Labor market participation

Heckman selection models are widely used in labor economics to study wage determination and labor market outcomes
The selection equation models the decision to participate in the labor market, while the outcome equation models the wage earned by those who participate
The model helps to correct for the selection bias that arises from the fact that wages are only observed for individuals who choose to work

Program evaluation

Selection models are also used in program evaluation to estimate the impact of a treatment or intervention on an outcome of interest
The selection equation models the decision to participate in the program, while the outcome equation models the effect of the program on the outcome for those who participate
The Heckman approach helps to address the issue of self-selection into the program, which can lead to biased estimates of the program's impact

Healthcare utilization

In health economics, selection models are used to study the determinants of healthcare utilization and the effect of healthcare on health outcomes
The selection equation models the decision to seek healthcare, while the outcome equation models the relationship between healthcare utilization and health outcomes
The Heckman approach helps to account for the non-random selection of individuals into healthcare, which can be influenced by factors such as health status and preferences

Alternatives to Heckman model

Instrumental variables approach

Instrumental variables (IV) can be used as an alternative to the Heckman model when there are concerns about endogeneity in the explanatory variables
The IV approach relies on finding a variable (instrument) that is correlated with the endogenous explanatory variable but uncorrelated with the error term in the outcome equation
The IV estimator provides consistent estimates of the parameters, but it may be less efficient than the Heckman approach if the assumptions of the Heckman model are satisfied

Propensity score matching

Propensity score matching (PSM) is a non-parametric method for addressing selection bias in observational studies
PSM involves estimating the probability of selection (propensity score) based on observed characteristics and then matching treated and untreated observations based on their propensity scores
PSM can be used as a preprocessing step to create a balanced sample before applying standard regression techniques

Control function approach

The control function approach is similar to the Heckman model in that it involves estimating a selection equation and including a correction term in the outcome equation
However, the control function approach is more flexible in terms of the distributional assumptions and can accommodate non-normality and heteroskedasticity in the errors
The control function approach is particularly useful when there are concerns about the validity of the exclusion restrictions or the functional form of the selection equation

🎳Intro to Econometrics Unit 6 Review

6.5 Heckman selection model

🎳Intro to Econometrics Unit 6 Review

6.5 Heckman selection model

Unit & Topic Study Guides

Overview of Heckman selection model

Motivation for selection models

Limitations of standard regression

Presence of selection bias

Examples of selection bias

Two-step estimation procedure

Step 1: Selection equation

Step 2: Outcome equation

Inverse Mills ratio

Interpretation of coefficients

Maximum likelihood estimation

Joint distribution of errors

Log-likelihood function

Comparison vs two-step approach

Identification in selection models

Exclusion restrictions

Nonlinearity as identification

Assumptions of Heckman model

Normality of errors

Homoskedasticity

Independence of errors

Marginal effects in selection models

Conditional marginal effects

Unconditional marginal effects

Strengths of Heckman approach

Correcting for selection bias

Consistent parameter estimates

Limitations of Heckman model

Sensitivity to distributional assumptions

Difficulty finding exclusion restrictions

Applications of selection models

Labor market participation

Program evaluation

Healthcare utilization

Alternatives to Heckman model

Instrumental variables approach

Propensity score matching

Control function approach

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🎳Intro to Econometrics
Unit 6 Review