Fiveable

๐Ÿ“ŠProbability and Statistics Unit 10 Review

QR code for Probability and Statistics practice questions

10.3 Simple linear regression model

๐Ÿ“ŠProbability and Statistics
Unit 10 Review

10.3 Simple linear regression model

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“ŠProbability and Statistics
Unit & Topic Study Guides

Simple linear regression models the relationship between two continuous variables, assuming a linear connection. It's a fundamental tool for predicting and analyzing how changes in one variable affect another.

This technique uses an explanatory variable to predict a response variable. By fitting a line of best fit, calculating slope and y-intercept, and assessing model fit, we can understand and quantify the relationship between variables.

Simple linear regression

  • Fundamental statistical technique used to model and analyze the relationship between two continuous variables
  • Assumes a linear relationship exists between the explanatory variable (X) and the response variable (Y)
  • Enables predictions and inferences about the response variable based on the values of the explanatory variable

Relationship between variables

  • Examines how changes in one variable are associated with changes in another variable
  • Determines the strength and direction of the relationship between the explanatory and response variables

Explanatory vs response variables

  • Explanatory variable (X) is the independent variable that is used to explain or predict changes in the response variable
  • Response variable (Y) is the dependent variable that is being explained or predicted by the explanatory variable
  • Example: In a study of the relationship between study time (X) and exam scores (Y), study time is the explanatory variable, and exam scores are the response variable

Scatterplots

  • Graphical representation of the relationship between two continuous variables
  • Each data point represents a pair of values for the explanatory and response variables
  • Scatterplots help visualize the strength, direction, and shape of the relationship between variables
  • Example: A scatterplot of height (X) and weight (Y) data points can reveal a positive linear relationship

Line of best fit

  • Represents the linear model that best describes the relationship between the explanatory and response variables
  • Determined by finding the line that minimizes the sum of squared residuals (differences between observed and predicted values)

Slope and y-intercept

  • Slope ($\beta_1$) represents the change in the response variable (Y) for a one-unit increase in the explanatory variable (X)
    • Indicates the direction and strength of the linear relationship
  • Y-intercept ($\beta_0$) represents the predicted value of the response variable when the explanatory variable is zero
    • Provides a reference point for the line of best fit

Residuals and errors

  • Residuals are the differences between the observed values of the response variable and the predicted values from the regression line
  • Errors represent the unexplained variability in the response variable that is not accounted for by the linear model
  • Smaller residuals and errors indicate a better fit of the model to the data

Least squares method

  • Statistical method used to estimate the parameters (slope and y-intercept) of the line of best fit
  • Minimizes the sum of squared residuals to find the optimal line that best fits the data

Minimizing sum of squared residuals

  • The line of best fit is chosen by finding the values of the slope and y-intercept that minimize the sum of squared residuals
  • Squaring the residuals ensures that positive and negative residuals do not cancel each other out
  • Minimizing the sum of squared residuals provides the most accurate estimates of the model parameters

Assessing model fit

  • Evaluating how well the linear regression model fits the observed data
  • Determines the proportion of variability in the response variable that is explained by the explanatory variable

Coefficient of determination (R-squared)

  • Measures the proportion of variability in the response variable that is explained by the linear regression model
  • Ranges from 0 to 1, with higher values indicating a better fit of the model to the data
  • Calculated as the ratio of the explained variance to the total variance

Interpretation of R-squared

  • An R-squared value of 0 indicates that the linear model does not explain any of the variability in the response variable
  • An R-squared value of 1 indicates that the linear model perfectly explains all the variability in the response variable
  • Example: An R-squared value of 0.75 means that 75% of the variability in the response variable is explained by the linear model

Correlation coefficient

  • Measures the strength and direction of the linear relationship between two continuous variables
  • Ranges from -1 to 1, with values closer to -1 or 1 indicating a stronger linear relationship

Pearson correlation coefficient

  • Most commonly used correlation coefficient for simple linear regression
  • Measures the strength and direction of the linear relationship between the explanatory and response variables
  • Calculated using the covariance of the two variables divided by the product of their standard deviations

Interpretation of correlation

  • A correlation coefficient of 1 indicates a perfect positive linear relationship
  • A correlation coefficient of -1 indicates a perfect negative linear relationship
  • A correlation coefficient of 0 indicates no linear relationship between the variables
  • Example: A correlation coefficient of 0.8 suggests a strong positive linear relationship between the variables

Hypothesis tests

  • Statistical procedures used to test the significance of the relationship between the explanatory and response variables
  • Determine whether the observed relationship is likely to have occurred by chance or if it represents a true relationship in the population

Significance of slope

  • Tests the null hypothesis that the slope of the regression line is equal to zero (no linear relationship)
  • If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating a significant linear relationship

Confidence intervals

  • Provide a range of plausible values for the slope and y-intercept of the regression line
  • Indicate the precision and uncertainty associated with the estimated model parameters
  • Example: A 95% confidence interval for the slope of (0.5, 1.2) suggests that the true slope is likely to fall within this range with 95% confidence

Checking model assumptions

  • Verifying that the assumptions underlying simple linear regression are met to ensure the validity of the model and its inferences
  • Violations of assumptions can lead to biased or unreliable results

Linearity

  • The relationship between the explanatory and response variables should be linear
  • Scatterplots can be used to visually assess linearity
  • Residual plots (residuals vs. explanatory variable) can also help detect non-linearity

Independence of errors

  • The errors (residuals) should be independent of each other
  • Violations can occur when data points are collected over time or have a spatial relationship
  • Durbin-Watson test can be used to assess the independence of errors

Constant variance of errors

  • The variability of the errors should be constant across all levels of the explanatory variable (homoscedasticity)
  • Non-constant variance (heteroscedasticity) can be detected using residual plots (residuals vs. fitted values)

Normality of errors

  • The errors should be normally distributed with a mean of zero
  • Normal probability plots or histograms of the residuals can be used to assess normality
  • Shapiro-Wilk or Kolmogorov-Smirnov tests can formally test for normality of errors

Outliers and influential points

  • Observations that deviate substantially from the overall pattern of the data or have a disproportionate impact on the regression model
  • Can affect the estimates of the model parameters and the goodness of fit

Identifying outliers

  • Outliers can be identified using scatterplots or residual plots
  • Points that are far from the majority of the data or have large residuals may be considered outliers
  • Example: In a scatterplot of height and weight, a data point with a height of 200 cm and a weight of 50 kg would be an outlier

Leverage and influence

  • Leverage measures the distance of an observation from the mean of the explanatory variable
  • High leverage points can have a strong influence on the regression line
  • Influence measures the impact of an observation on the model parameters or fitted values
  • Cook's distance is a measure that combines leverage and residuals to assess the overall influence of an observation

Predictions using regression model

  • Using the estimated regression equation to predict the value of the response variable for a given value of the explanatory variable
  • Allows for interpolation and extrapolation based on the observed data

Interpolation vs extrapolation

  • Interpolation involves making predictions within the range of the observed explanatory variable values
  • Extrapolation involves making predictions beyond the range of the observed explanatory variable values
  • Extrapolation carries more uncertainty and should be done with caution

Limitations of simple linear regression

  • Assumes a linear relationship between the explanatory and response variables, which may not always be appropriate
  • Does not account for the influence of other variables that may affect the response variable
  • Sensitive to outliers and influential points, which can distort the model estimates
  • Limited to modeling the relationship between two continuous variables
  • Causal inferences cannot be made solely based on the regression results, as correlation does not imply causation