Fiveable

๐ŸŽณIntro to Econometrics Unit 6 Review

QR code for Intro to Econometrics practice questions

6.4 Sample selection bias

๐ŸŽณIntro to Econometrics
Unit 6 Review

6.4 Sample selection bias

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸŽณIntro to Econometrics
Unit & Topic Study Guides

Sample selection bias is a critical issue in econometrics that can lead to skewed results and faulty conclusions. It occurs when the sample used in a study isn't representative of the population, resulting in biased estimates and reduced external validity.

Detecting and correcting for sample selection bias is crucial for accurate analysis. Methods like comparing sample characteristics to the population, using the Heckman selection model, and applying techniques such as inverse probability weighting can help address this issue.

Types of sample selection bias

  • Sample selection bias occurs when the sample used in a study is not representative of the population of interest, leading to biased and inconsistent estimates
  • Arises due to non-random selection or self-selection of individuals into or out of the sample based on unobserved factors that are correlated with both the dependent variable and the independent variables
  • Common types include non-response bias (individuals who refuse to participate in a survey may differ systematically from those who do participate), incidental truncation (sample is truncated based on some variable of interest, such as observing wages only for employed individuals), and self-selection bias (individuals self-select into treatment or control groups based on unobserved characteristics)

Consequences of sample selection bias

Biased parameter estimates

  • Sample selection bias leads to biased and inconsistent estimates of the parameters of interest in a regression model
  • The estimated coefficients will be biased because the sample used for estimation is not representative of the true population
  • The direction and magnitude of the bias depend on the nature of the selection process and the correlation between the unobserved factors affecting selection and the dependent variable
    • For example, if high-ability individuals are more likely to self-select into a training program and also have higher earnings, the estimated effect of the training program on earnings will be upward biased

Incorrect inferences

  • Biased parameter estimates due to sample selection can lead to incorrect inferences and conclusions about the relationship between variables
  • Hypothesis tests and confidence intervals based on biased estimates will be invalid, potentially leading to Type I (false positive) or Type II (false negative) errors
  • The presence of sample selection bias can make it difficult to establish causal relationships between variables, as the observed association may be driven by unobserved factors rather than a true causal effect

Reduced external validity

  • Sample selection bias can limit the external validity or generalizability of the study's findings to the broader population of interest
  • If the sample is not representative of the population, the estimated relationships and effects may not hold for individuals outside the sample
  • This can be particularly problematic when the goal is to make policy recommendations or draw conclusions that are applicable to a wider population
    • For instance, if a study on the effectiveness of a job training program only includes individuals who chose to participate, the results may not generalize to the population of all eligible individuals

Detecting sample selection bias

Comparing sample vs population

  • One way to detect sample selection bias is to compare the characteristics of the sample used in the study with those of the target population
  • Significant differences between the sample and population in terms of observable characteristics (such as demographics or socioeconomic status) may indicate the presence of selection bias
  • Statistical tests, such as t-tests or chi-square tests, can be used to assess whether the differences between the sample and population are statistically significant
    • For example, if a survey on income has a significantly higher proportion of high-income respondents compared to the population, this may suggest non-response bias

Using Heckman selection model

  • The Heckman selection model is a statistical method designed to detect and correct for sample selection bias in regression analysis
  • It involves estimating two equations: a selection equation that models the probability of an individual being included in the sample, and an outcome equation that models the relationship between the dependent variable and independent variables for the selected sample
  • The Heckman model includes an additional term, the inverse Mills ratio, in the outcome equation to account for the potential correlation between the unobserved factors affecting selection and the dependent variable
    • A statistically significant coefficient on the inverse Mills ratio indicates the presence of sample selection bias

Correcting for sample selection bias

Heckman two-step procedure

  • The Heckman two-step procedure is a method for correcting sample selection bias in regression analysis
  • In the first step, a probit model is estimated to predict the probability of an individual being included in the sample based on observed characteristics (the selection equation)
  • In the second step, the inverse Mills ratio is computed from the predicted probabilities and included as an additional regressor in the outcome equation, which is then estimated using ordinary least squares (OLS)
    • The coefficient on the inverse Mills ratio captures the effect of sample selection bias, and the remaining coefficients provide consistent estimates of the parameters of interest

Maximum likelihood estimation

  • Maximum likelihood estimation (MLE) is an alternative approach to correcting for sample selection bias
  • MLE involves specifying a joint distribution for the selection and outcome equations and estimating the parameters of both equations simultaneously by maximizing the likelihood function
  • MLE is more efficient than the two-step procedure and provides consistent estimates of the parameters, but it requires stronger distributional assumptions and can be more computationally intensive
    • MLE is often used when the selection and outcome equations are believed to have a specific joint distribution, such as a bivariate normal distribution

Inverse probability weighting

  • Inverse probability weighting (IPW) is a method for correcting sample selection bias by reweighting the observed sample to make it representative of the population
  • IPW involves estimating the probability of each individual being included in the sample (the propensity score) based on observed characteristics and then weighting each observation by the inverse of its propensity score
  • Observations with a low probability of being selected receive higher weights, while observations with a high probability of being selected receive lower weights
    • The reweighted sample mimics the distribution of the population, allowing for consistent estimation of the parameters of interest
  • IPW is particularly useful when the selection process is based on observable characteristics and does not require specifying a joint distribution for the selection and outcome equations

Examples of sample selection bias

Non-response bias in surveys

  • Non-response bias occurs when individuals who refuse to participate in a survey differ systematically from those who do participate
  • For example, in a survey on income, high-income individuals may be less likely to respond due to privacy concerns or time constraints
  • If non-respondents have systematically different incomes than respondents, the estimated average income from the survey will be biased
    • To correct for non-response bias, researchers can use methods such as weighting the sample based on observable characteristics or imputing missing values based on the characteristics of respondents

Incidental truncation in labor economics

  • Incidental truncation arises when the sample is truncated based on some variable of interest, such as observing wages only for employed individuals
  • In labor economics, incidental truncation can occur when studying the determinants of wages because wages are only observed for individuals who are employed
  • If the factors that affect an individual's decision to work are correlated with the factors that affect their wage, the estimated wage equation will suffer from sample selection bias
    • The Heckman selection model can be used to correct for incidental truncation by modeling the employment decision and the wage equation simultaneously

Self-selection bias in treatment effects

  • Self-selection bias occurs when individuals self-select into treatment or control groups based on unobserved characteristics that are correlated with the outcome of interest
  • For example, in a study on the effectiveness of a job training program, individuals who choose to participate may have higher motivation or ability than those who do not participate
  • If these unobserved characteristics are positively correlated with employment outcomes, the estimated effect of the training program will be upward biased
    • To correct for self-selection bias, researchers can use methods such as instrumental variables (using a variable that affects participation but not the outcome) or propensity score matching (matching treated and control individuals based on observable characteristics)