🎳Intro to Econometrics Unit 2 Review

2.1 Simple linear regression model

🎳Intro to Econometrics
Unit 2 Review

2.1 Simple linear regression model

Written by the Fiveable Content Team • Last updated September 2025

🎳Intro to Econometrics

Unit & Topic Study Guides

2.1 Simple linear regression model

2.2 Ordinary least squares (OLS) estimation

2.3 Multiple linear regression model

2.4 Coefficient interpretation

2.5 Goodness of fit measures

Simple linear regression is a powerful tool in econometrics for analyzing relationships between two variables. It models how changes in one variable (independent) affect another (dependent), allowing economists to make predictions and draw insights from data.

This method forms the foundation for more complex regression analyses. By understanding its key components, assumptions, and limitations, students can grasp the basics of statistical modeling and prepare for advanced econometric techniques used in real-world economic research and decision-making.

Definition of simple linear regression

Simple linear regression is a statistical method used to model and analyze the linear relationship between two continuous variables in the field of econometrics
Establishes a mathematical equation that describes how changes in one variable (the independent variable) are associated with changes in another variable (the dependent variable)
Provides a way to make predictions about the dependent variable based on the values of the independent variable

Relationship between two variables

Simple linear regression focuses on the relationship between two variables, typically denoted as X (the independent variable) and Y (the dependent variable)
The independent variable X is used to explain or predict changes in the dependent variable Y
The relationship is assumed to be linear, meaning that a change in X is associated with a proportional change in Y

Dependent vs independent variables

The dependent variable (Y) is the variable that is being explained or predicted in the regression model
The independent variable (X) is the variable that is used to explain or predict changes in the dependent variable
In econometrics, examples of dependent variables could be consumer spending or GDP growth, while independent variables could be income levels or interest rates

Assumptions of simple linear regression

Simple linear regression relies on several key assumptions to ensure the validity and reliability of the model's results
Violating these assumptions can lead to biased or inefficient estimates of the regression coefficients and incorrect conclusions

Linearity of relationship

The relationship between the independent variable (X) and the dependent variable (Y) is assumed to be linear
This means that a change in X is associated with a constant change in Y, regardless of the value of X
If the relationship is not linear, the simple linear regression model may not be appropriate, and other regression techniques (polynomial regression) may be needed

Independence of errors

The errors (residuals) in the regression model are assumed to be independent of each other
This means that the value of one residual does not depend on the values of other residuals
Violation of this assumption (autocorrelation) can lead to biased standard errors and incorrect conclusions about the significance of the regression coefficients

Normality of errors

The errors in the regression model are assumed to follow a normal distribution with a mean of zero
This assumption is important for making valid inferences about the regression coefficients and constructing confidence intervals
Non-normality of errors can affect the validity of hypothesis tests and confidence intervals

Homoscedasticity of errors

Homoscedasticity assumes that the variance of the errors is constant across all levels of the independent variable
In other words, the spread of the residuals should be similar for low and high values of X
Heteroscedasticity (non-constant variance) can lead to biased standard errors and affect the efficiency of the OLS estimators

Ordinary least squares (OLS) method

OLS is a widely used method for estimating the parameters (coefficients) of a simple linear regression model
The goal of OLS is to find the line of best fit that minimizes the sum of squared differences between the observed values of the dependent variable and the values predicted by the regression line

Minimizing sum of squared residuals

The OLS method estimates the regression coefficients by minimizing the sum of squared residuals (differences between observed and predicted values)
Residuals are calculated as: $e_i = y_i - \hat{y}_i$, where $y_i$ is the observed value and $\hat{y}_i$ is the predicted value
The sum of squared residuals is given by: $\sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Finding line of best fit

The line of best fit is the regression line that minimizes the sum of squared residuals
The equation for the line of best fit is: $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
$\hat{\beta}_0$ is the estimated intercept coefficient, and $\hat{\beta}_1$ is the estimated slope coefficient

Estimating regression coefficients

The OLS method provides formulas for estimating the regression coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$
Slope coefficient: $\hat{\beta}1 = \frac{\sum{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$
Intercept coefficient: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$, where $\bar{x}$ and $\bar{y}$ are the sample means of X and Y, respectively

Interpreting regression coefficients

The regression coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ have important interpretations in the context of the simple linear regression model
Understanding these coefficients allows researchers to draw meaningful conclusions about the relationship between the independent and dependent variables

Slope coefficient

The slope coefficient $\hat{\beta}_1$ represents the change in the dependent variable (Y) associated with a one-unit increase in the independent variable (X), holding other factors constant
Interpretation: "For every one-unit increase in X, Y is expected to change by $\hat{\beta}_1$ units, on average"
The sign of the slope coefficient indicates the direction of the relationship (positive or negative)

Intercept coefficient

The intercept coefficient $\hat{\beta}_0$ represents the expected value of the dependent variable (Y) when the independent variable (X) is equal to zero
Interpretation: "When X is zero, the expected value of Y is $\hat{\beta}_0$"
In some cases, the intercept may not have a meaningful interpretation, especially if the range of X does not include zero (age or income)

Units of measurement

The units of the regression coefficients depend on the units of the independent and dependent variables
Slope coefficient: The units of $\hat{\beta}_1$ are the units of Y per unit of X (dollars per year)
Intercept coefficient: The units of $\hat{\beta}_0$ are the same as the units of Y (dollars)

Assessing goodness of fit

Goodness of fit measures how well the simple linear regression model fits the observed data
These measures provide information about the proportion of variation in the dependent variable that is explained by the independent variable

Coefficient of determination (R-squared)

R-squared ($R^2$) is a commonly used measure of goodness of fit
It represents the proportion of variation in the dependent variable (Y) that is explained by the independent variable (X)
Formula: $R^2 = \frac{\text{Explained Sum of Squares (ESS)}}{\text{Total Sum of Squares (TSS)}} = 1 - \frac{\text{Residual Sum of Squares (RSS)}}{\text{Total Sum of Squares (TSS)}}$
R-squared ranges from 0 to 1, with higher values indicating a better fit

Adjusted R-squared

Adjusted R-squared is a modified version of R-squared that accounts for the number of independent variables in the model
It penalizes the addition of irrelevant independent variables, which may artificially inflate the R-squared value
Formula: $\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$, where $n$ is the sample size and $k$ is the number of independent variables
Adjusted R-squared is always lower than or equal to the regular R-squared

Standard error of regression

The standard error of regression (SER) measures the average distance between the observed values of the dependent variable and the predicted values from the regression line
Formula: $\text{SER} = \sqrt{\frac{\text{RSS}}{n - k - 1}}$, where $\text{RSS}$ is the residual sum of squares, $n$ is the sample size, and $k$ is the number of independent variables
A smaller standard error of regression indicates a better fit, as the predicted values are closer to the observed values on average

Hypothesis testing in simple linear regression

Hypothesis testing is used to assess the statistical significance of the estimated regression coefficients
It allows researchers to determine whether the observed relationships between the independent and dependent variables are likely to have occurred by chance or if they represent real associations

Null vs alternative hypotheses

The null hypothesis ($H_0$) states that there is no significant relationship between the independent variable (X) and the dependent variable (Y)
The alternative hypothesis ($H_a$) states that there is a significant relationship between X and Y
For the slope coefficient: $H_0: \beta_1 = 0$ vs. $H_a: \beta_1 \neq 0$
For the intercept coefficient: $H_0: \beta_0 = 0$ vs. $H_a: \beta_0 \neq 0$

t-tests for regression coefficients

t-tests are used to test the significance of individual regression coefficients
The test statistic is calculated as: $t = \frac{\hat{\beta}i - \beta{i,0}}{\text{SE}(\hat{\beta}_i)}$, where $\hat{\beta}i$ is the estimated coefficient, $\beta{i,0}$ is the hypothesized value (usually 0), and $\text{SE}(\hat{\beta}_i)$ is the standard error of the coefficient
The t-statistic follows a t-distribution with $(n - k - 1)$ degrees of freedom

p-values and statistical significance

The p-value is the probability of observing a test statistic as extreme as the one calculated, assuming the null hypothesis is true
A small p-value (typically < 0.05) provides evidence against the null hypothesis and suggests that the coefficient is statistically significant
If the p-value is less than the chosen significance level (0.05), the null hypothesis is rejected, and the coefficient is considered statistically significant

Confidence intervals for regression coefficients

Confidence intervals provide a range of plausible values for the true population regression coefficients
They are constructed using the estimated coefficients, their standard errors, and a specified confidence level

Interpreting confidence intervals

A 95% confidence interval for a regression coefficient can be interpreted as: "We are 95% confident that the true population coefficient lies within this interval"
Formula: $\hat{\beta}i \pm t{1-\alpha/2, n-k-1} \times \text{SE}(\hat{\beta}i)$, where $t{1-\alpha/2, n-k-1}$ is the critical value from the t-distribution with $(n - k - 1)$ degrees of freedom and $\alpha$ is the significance level
If the confidence interval does not contain zero, the coefficient is considered statistically significant at the chosen confidence level

Confidence level vs interval width

The confidence level (e.g., 95%) determines the probability that the true population coefficient lies within the constructed interval
A higher confidence level results in a wider interval, while a lower confidence level results in a narrower interval
Researchers must balance the desire for high confidence with the need for precise estimates when choosing a confidence level (90%, 95%, or 99%)

Prediction using simple linear regression

One of the main uses of simple linear regression is to make predictions about the dependent variable (Y) based on the values of the independent variable (X)
Predictions can be made within the range of observed X values (interpolation) or outside the range (extrapolation)

Interpolation vs extrapolation

Interpolation involves making predictions for X values that fall within the range of the observed data
Extrapolation involves making predictions for X values that are outside the range of the observed data
Extrapolation is generally less reliable than interpolation, as the relationship between X and Y may change outside the observed range

Prediction intervals vs confidence intervals

Prediction intervals provide a range of plausible values for an individual observation of the dependent variable (Y) given a specific value of the independent variable (X)
Confidence intervals, on the other hand, provide a range of plausible values for the mean value of Y given a specific value of X
Prediction intervals are always wider than confidence intervals because they account for both the uncertainty in the estimated mean and the variability of individual observations around the mean

Potential problems in simple linear regression

Several issues can arise in simple linear regression that may affect the validity and reliability of the model's results
Researchers should be aware of these potential problems and take steps to address them when necessary

Outliers and influential observations

Outliers are observations that have unusually large or small values compared to the rest of the data
Influential observations are data points that have a disproportionate impact on the estimated regression coefficients
Both outliers and influential observations can distort the regression results and lead to misleading conclusions
Researchers should identify and carefully examine these observations to determine whether they are valid or if they should be removed or treated differently

Multicollinearity in multiple regression

Multicollinearity occurs when there is a high degree of correlation among the independent variables in a multiple regression model
While not directly relevant to simple linear regression, multicollinearity can be a problem when extending the model to include multiple independent variables
Multicollinearity can lead to unstable and unreliable estimates of the regression coefficients and make it difficult to interpret the individual effects of the independent variables

Heteroscedasticity and non-normality of errors

Heteroscedasticity occurs when the variance of the errors is not constant across all levels of the independent variable, violating the assumption of homoscedasticity
Non-normality of errors occurs when the errors do not follow a normal distribution, violating the normality assumption
Both heteroscedasticity and non-normality can affect the validity of hypothesis tests and confidence intervals
Researchers can use diagnostic plots (residual plots) and statistical tests (Breusch-Pagan, White's test) to detect these issues and apply appropriate remedies (weighted least squares, robust standard errors)

Applications of simple linear regression

Simple linear regression is widely used in various fields, including economics, business, and social sciences, to analyze and model relationships between variables
It provides a foundation for more advanced regression techniques and helps researchers gain insights into the factors that influence a dependent variable

Examples from economics and business

Analyzing the relationship between consumer spending and disposable income
Modeling the effect of advertising expenditure on sales revenue
Examining the impact of years of education on individual earnings
Investigating the relationship between GDP growth and unemployment rates

Limitations of simple linear regression

Simple linear regression only considers the relationship between two variables and does not account for the potential influence of other factors
The model assumes a linear relationship between the variables, which may not always be appropriate
The model is sensitive to outliers and influential observations, which can distort the results
Simple linear regression does not establish causality between the variables; it only identifies associations

🎳Intro to Econometrics Unit 2 Review

2.1 Simple linear regression model

🎳Intro to Econometrics Unit 2 Review

2.1 Simple linear regression model

Unit & Topic Study Guides

Definition of simple linear regression

Relationship between two variables

Dependent vs independent variables

Assumptions of simple linear regression

Linearity of relationship

Independence of errors

Normality of errors

Homoscedasticity of errors

Ordinary least squares (OLS) method

Minimizing sum of squared residuals

Finding line of best fit

Estimating regression coefficients

Interpreting regression coefficients

Slope coefficient

Intercept coefficient

Units of measurement

Assessing goodness of fit

Coefficient of determination (R-squared)

Adjusted R-squared

Standard error of regression

Hypothesis testing in simple linear regression

Null vs alternative hypotheses

t-tests for regression coefficients

p-values and statistical significance

Confidence intervals for regression coefficients

Interpreting confidence intervals

Confidence level vs interval width

Prediction using simple linear regression

Interpolation vs extrapolation

Prediction intervals vs confidence intervals

Potential problems in simple linear regression

Outliers and influential observations

Multicollinearity in multiple regression

Heteroscedasticity and non-normality of errors

Applications of simple linear regression

Examples from economics and business

Limitations of simple linear regression

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🎳Intro to Econometrics
Unit 2 Review