Fiveable

🎳Intro to Econometrics Unit 2 Review

QR code for Intro to Econometrics practice questions

2.1 Simple linear regression model

🎳Intro to Econometrics
Unit 2 Review

2.1 Simple linear regression model

Written by the Fiveable Content Team • Last updated September 2025
Written by the Fiveable Content Team • Last updated September 2025
🎳Intro to Econometrics
Unit & Topic Study Guides

Simple linear regression is a powerful tool in econometrics for analyzing relationships between two variables. It models how changes in one variable (independent) affect another (dependent), allowing economists to make predictions and draw insights from data.

This method forms the foundation for more complex regression analyses. By understanding its key components, assumptions, and limitations, students can grasp the basics of statistical modeling and prepare for advanced econometric techniques used in real-world economic research and decision-making.

Definition of simple linear regression

  • Simple linear regression is a statistical method used to model and analyze the linear relationship between two continuous variables in the field of econometrics
  • Establishes a mathematical equation that describes how changes in one variable (the independent variable) are associated with changes in another variable (the dependent variable)
  • Provides a way to make predictions about the dependent variable based on the values of the independent variable

Relationship between two variables

  • Simple linear regression focuses on the relationship between two variables, typically denoted as X (the independent variable) and Y (the dependent variable)
  • The independent variable X is used to explain or predict changes in the dependent variable Y
  • The relationship is assumed to be linear, meaning that a change in X is associated with a proportional change in Y

Dependent vs independent variables

  • The dependent variable (Y) is the variable that is being explained or predicted in the regression model
  • The independent variable (X) is the variable that is used to explain or predict changes in the dependent variable
  • In econometrics, examples of dependent variables could be consumer spending or GDP growth, while independent variables could be income levels or interest rates

Assumptions of simple linear regression

  • Simple linear regression relies on several key assumptions to ensure the validity and reliability of the model's results
  • Violating these assumptions can lead to biased or inefficient estimates of the regression coefficients and incorrect conclusions

Linearity of relationship

  • The relationship between the independent variable (X) and the dependent variable (Y) is assumed to be linear
  • This means that a change in X is associated with a constant change in Y, regardless of the value of X
  • If the relationship is not linear, the simple linear regression model may not be appropriate, and other regression techniques (polynomial regression) may be needed

Independence of errors

  • The errors (residuals) in the regression model are assumed to be independent of each other
  • This means that the value of one residual does not depend on the values of other residuals
  • Violation of this assumption (autocorrelation) can lead to biased standard errors and incorrect conclusions about the significance of the regression coefficients

Normality of errors

  • The errors in the regression model are assumed to follow a normal distribution with a mean of zero
  • This assumption is important for making valid inferences about the regression coefficients and constructing confidence intervals
  • Non-normality of errors can affect the validity of hypothesis tests and confidence intervals

Homoscedasticity of errors

  • Homoscedasticity assumes that the variance of the errors is constant across all levels of the independent variable
  • In other words, the spread of the residuals should be similar for low and high values of X
  • Heteroscedasticity (non-constant variance) can lead to biased standard errors and affect the efficiency of the OLS estimators

Ordinary least squares (OLS) method

  • OLS is a widely used method for estimating the parameters (coefficients) of a simple linear regression model
  • The goal of OLS is to find the line of best fit that minimizes the sum of squared differences between the observed values of the dependent variable and the values predicted by the regression line

Minimizing sum of squared residuals

  • The OLS method estimates the regression coefficients by minimizing the sum of squared residuals (differences between observed and predicted values)
  • Residuals are calculated as: $e_i = y_i - \hat{y}_i$, where $y_i$ is the observed value and $\hat{y}_i$ is the predicted value
  • The sum of squared residuals is given by: $\sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Finding line of best fit

  • The line of best fit is the regression line that minimizes the sum of squared residuals
  • The equation for the line of best fit is: $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
  • $\hat{\beta}_0$ is the estimated intercept coefficient, and $\hat{\beta}_1$ is the estimated slope coefficient

Estimating regression coefficients

  • The OLS method provides formulas for estimating the regression coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$
  • Slope coefficient: $\hat{\beta}1 = \frac{\sum{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$
  • Intercept coefficient: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$, where $\bar{x}$ and $\bar{y}$ are the sample means of X and Y, respectively

Interpreting regression coefficients

  • The regression coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ have important interpretations in the context of the simple linear regression model
  • Understanding these coefficients allows researchers to draw meaningful conclusions about the relationship between the independent and dependent variables

Slope coefficient

  • The slope coefficient $\hat{\beta}_1$ represents the change in the dependent variable (Y) associated with a one-unit increase in the independent variable (X), holding other factors constant
  • Interpretation: "For every one-unit increase in X, Y is expected to change by $\hat{\beta}_1$ units, on average"
  • The sign of the slope coefficient indicates the direction of the relationship (positive or negative)

Intercept coefficient

  • The intercept coefficient $\hat{\beta}_0$ represents the expected value of the dependent variable (Y) when the independent variable (X) is equal to zero
  • Interpretation: "When X is zero, the expected value of Y is $\hat{\beta}_0$"
  • In some cases, the intercept may not have a meaningful interpretation, especially if the range of X does not include zero (age or income)

Units of measurement

  • The units of the regression coefficients depend on the units of the independent and dependent variables
  • Slope coefficient: The units of $\hat{\beta}_1$ are the units of Y per unit of X (dollars per year)
  • Intercept coefficient: The units of $\hat{\beta}_0$ are the same as the units of Y (dollars)

Assessing goodness of fit

  • Goodness of fit measures how well the simple linear regression model fits the observed data
  • These measures provide information about the proportion of variation in the dependent variable that is explained by the independent variable

Coefficient of determination (R-squared)

  • R-squared ($R^2$) is a commonly used measure of goodness of fit
  • It represents the proportion of variation in the dependent variable (Y) that is explained by the independent variable (X)
  • Formula: $R^2 = \frac{\text{Explained Sum of Squares (ESS)}}{\text{Total Sum of Squares (TSS)}} = 1 - \frac{\text{Residual Sum of Squares (RSS)}}{\text{Total Sum of Squares (TSS)}}$
  • R-squared ranges from 0 to 1, with higher values indicating a better fit

Adjusted R-squared

  • Adjusted R-squared is a modified version of R-squared that accounts for the number of independent variables in the model
  • It penalizes the addition of irrelevant independent variables, which may artificially inflate the R-squared value
  • Formula: $\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$, where $n$ is the sample size and $k$ is the number of independent variables
  • Adjusted R-squared is always lower than or equal to the regular R-squared

Standard error of regression

  • The standard error of regression (SER) measures the average distance between the observed values of the dependent variable and the predicted values from the regression line
  • Formula: $\text{SER} = \sqrt{\frac{\text{RSS}}{n - k - 1}}$, where $\text{RSS}$ is the residual sum of squares, $n$ is the sample size, and $k$ is the number of independent variables
  • A smaller standard error of regression indicates a better fit, as the predicted values are closer to the observed values on average

Hypothesis testing in simple linear regression

  • Hypothesis testing is used to assess the statistical significance of the estimated regression coefficients
  • It allows researchers to determine whether the observed relationships between the independent and dependent variables are likely to have occurred by chance or if they represent real associations

Null vs alternative hypotheses

  • The null hypothesis ($H_0$) states that there is no significant relationship between the independent variable (X) and the dependent variable (Y)
  • The alternative hypothesis ($H_a$) states that there is a significant relationship between X and Y
  • For the slope coefficient: $H_0: \beta_1 = 0$ vs. $H_a: \beta_1 \neq 0$
  • For the intercept coefficient: $H_0: \beta_0 = 0$ vs. $H_a: \beta_0 \neq 0$

t-tests for regression coefficients

  • t-tests are used to test the significance of individual regression coefficients
  • The test statistic is calculated as: $t = \frac{\hat{\beta}i - \beta{i,0}}{\text{SE}(\hat{\beta}_i)}$, where $\hat{\beta}i$ is the estimated coefficient, $\beta{i,0}$ is the hypothesized value (usually 0), and $\text{SE}(\hat{\beta}_i)$ is the standard error of the coefficient
  • The t-statistic follows a t-distribution with $(n - k - 1)$ degrees of freedom

p-values and statistical significance

  • The p-value is the probability of observing a test statistic as extreme as the one calculated, assuming the null hypothesis is true
  • A small p-value (typically < 0.05) provides evidence against the null hypothesis and suggests that the coefficient is statistically significant
  • If the p-value is less than the chosen significance level (0.05), the null hypothesis is rejected, and the coefficient is considered statistically significant

Confidence intervals for regression coefficients

  • Confidence intervals provide a range of plausible values for the true population regression coefficients
  • They are constructed using the estimated coefficients, their standard errors, and a specified confidence level

Interpreting confidence intervals

  • A 95% confidence interval for a regression coefficient can be interpreted as: "We are 95% confident that the true population coefficient lies within this interval"
  • Formula: $\hat{\beta}i \pm t{1-\alpha/2, n-k-1} \times \text{SE}(\hat{\beta}i)$, where $t{1-\alpha/2, n-k-1}$ is the critical value from the t-distribution with $(n - k - 1)$ degrees of freedom and $\alpha$ is the significance level
  • If the confidence interval does not contain zero, the coefficient is considered statistically significant at the chosen confidence level

Confidence level vs interval width

  • The confidence level (e.g., 95%) determines the probability that the true population coefficient lies within the constructed interval
  • A higher confidence level results in a wider interval, while a lower confidence level results in a narrower interval
  • Researchers must balance the desire for high confidence with the need for precise estimates when choosing a confidence level (90%, 95%, or 99%)

Prediction using simple linear regression

  • One of the main uses of simple linear regression is to make predictions about the dependent variable (Y) based on the values of the independent variable (X)
  • Predictions can be made within the range of observed X values (interpolation) or outside the range (extrapolation)

Interpolation vs extrapolation

  • Interpolation involves making predictions for X values that fall within the range of the observed data
  • Extrapolation involves making predictions for X values that are outside the range of the observed data
  • Extrapolation is generally less reliable than interpolation, as the relationship between X and Y may change outside the observed range

Prediction intervals vs confidence intervals

  • Prediction intervals provide a range of plausible values for an individual observation of the dependent variable (Y) given a specific value of the independent variable (X)
  • Confidence intervals, on the other hand, provide a range of plausible values for the mean value of Y given a specific value of X
  • Prediction intervals are always wider than confidence intervals because they account for both the uncertainty in the estimated mean and the variability of individual observations around the mean

Potential problems in simple linear regression

  • Several issues can arise in simple linear regression that may affect the validity and reliability of the model's results
  • Researchers should be aware of these potential problems and take steps to address them when necessary

Outliers and influential observations

  • Outliers are observations that have unusually large or small values compared to the rest of the data
  • Influential observations are data points that have a disproportionate impact on the estimated regression coefficients
  • Both outliers and influential observations can distort the regression results and lead to misleading conclusions
  • Researchers should identify and carefully examine these observations to determine whether they are valid or if they should be removed or treated differently

Multicollinearity in multiple regression

  • Multicollinearity occurs when there is a high degree of correlation among the independent variables in a multiple regression model
  • While not directly relevant to simple linear regression, multicollinearity can be a problem when extending the model to include multiple independent variables
  • Multicollinearity can lead to unstable and unreliable estimates of the regression coefficients and make it difficult to interpret the individual effects of the independent variables

Heteroscedasticity and non-normality of errors

  • Heteroscedasticity occurs when the variance of the errors is not constant across all levels of the independent variable, violating the assumption of homoscedasticity
  • Non-normality of errors occurs when the errors do not follow a normal distribution, violating the normality assumption
  • Both heteroscedasticity and non-normality can affect the validity of hypothesis tests and confidence intervals
  • Researchers can use diagnostic plots (residual plots) and statistical tests (Breusch-Pagan, White's test) to detect these issues and apply appropriate remedies (weighted least squares, robust standard errors)

Applications of simple linear regression

  • Simple linear regression is widely used in various fields, including economics, business, and social sciences, to analyze and model relationships between variables
  • It provides a foundation for more advanced regression techniques and helps researchers gain insights into the factors that influence a dependent variable

Examples from economics and business

  • Analyzing the relationship between consumer spending and disposable income
  • Modeling the effect of advertising expenditure on sales revenue
  • Examining the impact of years of education on individual earnings
  • Investigating the relationship between GDP growth and unemployment rates

Limitations of simple linear regression

  • Simple linear regression only considers the relationship between two variables and does not account for the potential influence of other factors
  • The model assumes a linear relationship between the variables, which may not always be appropriate
  • The model is sensitive to outliers and influential observations, which can distort the results
  • Simple linear regression does not establish causality between the variables; it only identifies associations