Fiveable
Fiveable

or

Log in

Find what you need to study


Light

Find what you need to study

2.7 Residuals

5 min readdecember 29, 2022

Athena_Codes

Athena_Codes

Jed Quiaoit

Jed Quiaoit

Athena_Codes

Athena_Codes

Jed Quiaoit

Jed Quiaoit

When evaluating the effectiveness of a , we use to do this.

What is a residual, then?

are a measure of how well a fits the data. They are the differences between the observed values of the (y) and the (ŷ) from the model, so y - ŷ.

In a , the goal is to find the line of best fit that minimizes the sum of the squared . This is known as the . The for each point represent the vertical distance between the point and the line of best fit. If a point has a small residual, it means that the model is predicting the value of the well for that point. On the other hand, if a point has a large residual, it means that the model is not predicting the value of the well for that point.

Here's another way to think about it, this time in terms of having "positive" and "negative" : if we have a positive residual, then the actual value is greater than the predicted value and we say that the model underestimates the true value by a certain amount. Likewise, if we have a negative residual, then the actual value is less than the predicted value and we say that the model overestimates the true value by a certain amount.

Residual Plots

A is a graph that plots the (the differences between the observed values and the ) on the vertical axis and the predictor or explanatory variable on the horizontal axis. I

If the for a exhibits apparent , it can be taken as evidence that the relationship between the predictor and response variables is linear. In other words, it suggests that the model is capturing the underlying relationship in the data correctly.

"Apparent " means that the are randomly dispersed around the horizontal axis, as it indicates that the model is fitting the data well and that the are not systematically related to the .

Here are two examples of scatterplots with linear regression models (and also their residual plots).

Example 1

In the example below, we can see that our fits our data fairly well ( on left). Therefore, the (on right) seems to show no apparent pattern. Our red points seem equally scattered about the red line at 0.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreen%20Shot%202020-04-25%20at%2011.54-eRh0y9LKXEtA.png?alt=media&token=963287b4-2f4a-46b5-8f97-7265ab19aaef

Courtesy of Starnes, Daren S. and Tabor, Josh. The Practice of Statistics—For the AP Exam, 5th Edition. Cengage Publishing.

Example 2

In this data, we can clearly see that our data follows a curved pattern, not the linear model pictured ( on left). Therefore, our (on right) shows an apparent curved pattern. We will learn more about these types of models in Unit 2.9 and how to adjust these to create a linear model.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreen%20Shot%202020-04-25%20at%2011.55-4qQSeMSW86Hi.png?alt=media&token=10c978a8-811c-4084-b843-7b8329135091

Courtesy of Starnes, Daren S. and Tabor, Josh. The Practice of Statistics—For the AP Exam, 5th Edition. Cengage Publishing.

Good or Bad?

How do we tell whether a model is good? Look at the . For a good model, the should be randomly scattered and have no clear pattern like with the first set above. In the second set, there is a distinct curve in the , meaning that a is not appropriate to the and a would be best.

Calculating Residuals

In order to calculate a residual for a given data point, we need the LSRL for that data set and the given data point.

We will first calculate the predicted value using the LSRL. Then, we subtract the predicted value from the actual value in the given data point. In other words, our formula is Residual = (Actual)-(Predicted).

Example 1

A LSRL model for the predicted amount of Lucky Charms eaten in accordance with one's age in years is given by the equation below:

ŷ=150.5x-2.34

A 50 year old from our data set is said to have eaten 7,500 lucky charms in his life! Wow! I hope he found the 💰 at the end of the 🌈! Calculate the residual for his number.

ŷ = 150.5(50) - 2.34

ŷ = 7522.66

Residual is 7500 - 7522.66= -22.66.

Keep in mind that sometimes you may be asked to calculate one's actual data point (or predicted data point) when given the residual. This would require the same formula, but working backwards.

Example 2

A researcher is studying the relationship between the number of hours spent studying for an exam and the score received on the exam. She collects data from 50 students and fits a to the data. The for the model is shown below:

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-Vbw6NkQuXl3R.png?alt=media&token=0f25533c-5382-4e83-bf0a-aa6894827da7

a) Describe the pattern, if any, in the .

b) Explain what the pattern in the suggests about the fit of the model.

c) If the model is not fitting the data well, suggest one potential reason why this may be the case.

d) Assuming that the model is not fitting the data well, propose one potential solution to improve the fit of the model.

e) Explain how the solution you proposed in part (d) would address the issue with the model.

Answers

a) The exhibits a curved pattern.

b) The pattern in the suggests that the fit of the model is not good. The are not randomly dispersed around the horizontal axis, indicating that there is a systematic relationship between the predictor and response variables that is not being captured by the model.

c) One potential reason why the model may not be fitting the data well is that the relationship between the number of hours spent studying and the exam score is not linear. There may be some other underlying relationship between the variables that is not being captured by the model.

We'll learn more about d) and e) in future sections!

d) One potential solution to improve the fit of the model would be to transform the data in some way, such as by taking the logarithm of the number of hours spent studying or the exam score.

e) The solution proposed in part (d) would address the issue with the model by allowing the relationship between the predictor and response variables to be more accurately captured. A transformation may be able to uncover a more appropriate functional form for the relationship between the variables, leading to a better fit of the model.

🎥Watch: AP Stats - Least Squares Regression Lines

Key Terms to Review (11)

Least squares criterion

: The least squares criterion is used to find an equation (usually linear) that minimizes the sum of squared differences between observed and predicted values.

Linear Regression Model

: A linear regression model is a statistical approach used to model and analyze relationships between two variables, where one variable (dependent variable) can be predicted based on another variable (independent variable). It assumes that there exists a linear relationship between these variables.

LSRL (Least Squares Regression Line)

: The LSRL, also known as the least squares regression line, is a straight line that best represents the relationship between two variables by minimizing the sum of squared residuals. It is commonly used to predict values based on observed data points.

Nonlinear Model

: A nonlinear model is a statistical model that does not follow a straight line relationship between the independent and dependent variables. It involves curves, exponential functions, or other non-linear patterns.

Predicted Values

: Predicted values are the estimated values of the response variable based on a regression model. They are calculated using the regression equation and the given predictor variables.

Predictor Variable

: A predictor variable, also known as an independent variable, is a variable that is used to predict or explain the values of the response variable.

Randomness

: Randomness refers to an unpredictable and haphazard pattern where each outcome has an equal chance of occurring. It plays a crucial role in statistical experiments and sampling techniques.

Residual Plot

: A residual plot shows the differences between observed and predicted values in regression analysis. It helps identify patterns or trends in these differences, indicating whether linear regression assumptions are met.

Residuals

: Residuals are the differences between observed values and predicted values in a regression analysis. They represent the vertical distances between data points and the least-squares regression line.

Response Variable

: The response variable is the outcome or result that researchers measure and analyze in an experiment. It represents the effect or output of interest.

Scatterplot

: A scatterplot is a graph that displays the relationship between two quantitative variables. It uses dots to represent individual data points and shows how they are distributed along the x and y axes.

2.7 Residuals

5 min readdecember 29, 2022

Athena_Codes

Athena_Codes

Jed Quiaoit

Jed Quiaoit

Athena_Codes

Athena_Codes

Jed Quiaoit

Jed Quiaoit

When evaluating the effectiveness of a , we use to do this.

What is a residual, then?

are a measure of how well a fits the data. They are the differences between the observed values of the (y) and the (ŷ) from the model, so y - ŷ.

In a , the goal is to find the line of best fit that minimizes the sum of the squared . This is known as the . The for each point represent the vertical distance between the point and the line of best fit. If a point has a small residual, it means that the model is predicting the value of the well for that point. On the other hand, if a point has a large residual, it means that the model is not predicting the value of the well for that point.

Here's another way to think about it, this time in terms of having "positive" and "negative" : if we have a positive residual, then the actual value is greater than the predicted value and we say that the model underestimates the true value by a certain amount. Likewise, if we have a negative residual, then the actual value is less than the predicted value and we say that the model overestimates the true value by a certain amount.

Residual Plots

A is a graph that plots the (the differences between the observed values and the ) on the vertical axis and the predictor or explanatory variable on the horizontal axis. I

If the for a exhibits apparent , it can be taken as evidence that the relationship between the predictor and response variables is linear. In other words, it suggests that the model is capturing the underlying relationship in the data correctly.

"Apparent " means that the are randomly dispersed around the horizontal axis, as it indicates that the model is fitting the data well and that the are not systematically related to the .

Here are two examples of scatterplots with linear regression models (and also their residual plots).

Example 1

In the example below, we can see that our fits our data fairly well ( on left). Therefore, the (on right) seems to show no apparent pattern. Our red points seem equally scattered about the red line at 0.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreen%20Shot%202020-04-25%20at%2011.54-eRh0y9LKXEtA.png?alt=media&token=963287b4-2f4a-46b5-8f97-7265ab19aaef

Courtesy of Starnes, Daren S. and Tabor, Josh. The Practice of Statistics—For the AP Exam, 5th Edition. Cengage Publishing.

Example 2

In this data, we can clearly see that our data follows a curved pattern, not the linear model pictured ( on left). Therefore, our (on right) shows an apparent curved pattern. We will learn more about these types of models in Unit 2.9 and how to adjust these to create a linear model.

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2FScreen%20Shot%202020-04-25%20at%2011.55-4qQSeMSW86Hi.png?alt=media&token=10c978a8-811c-4084-b843-7b8329135091

Courtesy of Starnes, Daren S. and Tabor, Josh. The Practice of Statistics—For the AP Exam, 5th Edition. Cengage Publishing.

Good or Bad?

How do we tell whether a model is good? Look at the . For a good model, the should be randomly scattered and have no clear pattern like with the first set above. In the second set, there is a distinct curve in the , meaning that a is not appropriate to the and a would be best.

Calculating Residuals

In order to calculate a residual for a given data point, we need the LSRL for that data set and the given data point.

We will first calculate the predicted value using the LSRL. Then, we subtract the predicted value from the actual value in the given data point. In other words, our formula is Residual = (Actual)-(Predicted).

Example 1

A LSRL model for the predicted amount of Lucky Charms eaten in accordance with one's age in years is given by the equation below:

ŷ=150.5x-2.34

A 50 year old from our data set is said to have eaten 7,500 lucky charms in his life! Wow! I hope he found the 💰 at the end of the 🌈! Calculate the residual for his number.

ŷ = 150.5(50) - 2.34

ŷ = 7522.66

Residual is 7500 - 7522.66= -22.66.

Keep in mind that sometimes you may be asked to calculate one's actual data point (or predicted data point) when given the residual. This would require the same formula, but working backwards.

Example 2

A researcher is studying the relationship between the number of hours spent studying for an exam and the score received on the exam. She collects data from 50 students and fits a to the data. The for the model is shown below:

https://firebasestorage.googleapis.com/v0/b/fiveable-92889.appspot.com/o/images%2F-Vbw6NkQuXl3R.png?alt=media&token=0f25533c-5382-4e83-bf0a-aa6894827da7

a) Describe the pattern, if any, in the .

b) Explain what the pattern in the suggests about the fit of the model.

c) If the model is not fitting the data well, suggest one potential reason why this may be the case.

d) Assuming that the model is not fitting the data well, propose one potential solution to improve the fit of the model.

e) Explain how the solution you proposed in part (d) would address the issue with the model.

Answers

a) The exhibits a curved pattern.

b) The pattern in the suggests that the fit of the model is not good. The are not randomly dispersed around the horizontal axis, indicating that there is a systematic relationship between the predictor and response variables that is not being captured by the model.

c) One potential reason why the model may not be fitting the data well is that the relationship between the number of hours spent studying and the exam score is not linear. There may be some other underlying relationship between the variables that is not being captured by the model.

We'll learn more about d) and e) in future sections!

d) One potential solution to improve the fit of the model would be to transform the data in some way, such as by taking the logarithm of the number of hours spent studying or the exam score.

e) The solution proposed in part (d) would address the issue with the model by allowing the relationship between the predictor and response variables to be more accurately captured. A transformation may be able to uncover a more appropriate functional form for the relationship between the variables, leading to a better fit of the model.

🎥Watch: AP Stats - Least Squares Regression Lines

Key Terms to Review (11)

Least squares criterion

: The least squares criterion is used to find an equation (usually linear) that minimizes the sum of squared differences between observed and predicted values.

Linear Regression Model

: A linear regression model is a statistical approach used to model and analyze relationships between two variables, where one variable (dependent variable) can be predicted based on another variable (independent variable). It assumes that there exists a linear relationship between these variables.

LSRL (Least Squares Regression Line)

: The LSRL, also known as the least squares regression line, is a straight line that best represents the relationship between two variables by minimizing the sum of squared residuals. It is commonly used to predict values based on observed data points.

Nonlinear Model

: A nonlinear model is a statistical model that does not follow a straight line relationship between the independent and dependent variables. It involves curves, exponential functions, or other non-linear patterns.

Predicted Values

: Predicted values are the estimated values of the response variable based on a regression model. They are calculated using the regression equation and the given predictor variables.

Predictor Variable

: A predictor variable, also known as an independent variable, is a variable that is used to predict or explain the values of the response variable.

Randomness

: Randomness refers to an unpredictable and haphazard pattern where each outcome has an equal chance of occurring. It plays a crucial role in statistical experiments and sampling techniques.

Residual Plot

: A residual plot shows the differences between observed and predicted values in regression analysis. It helps identify patterns or trends in these differences, indicating whether linear regression assumptions are met.

Residuals

: Residuals are the differences between observed values and predicted values in a regression analysis. They represent the vertical distances between data points and the least-squares regression line.

Response Variable

: The response variable is the outcome or result that researchers measure and analyze in an experiment. It represents the effect or output of interest.

Scatterplot

: A scatterplot is a graph that displays the relationship between two quantitative variables. It uses dots to represent individual data points and shows how they are distributed along the x and y axes.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.