📊Probability and Statistics Unit 10 Review

10.1 Pearson correlation coefficient

📊Probability and Statistics
Unit 10 Review

10.1 Pearson correlation coefficient

Written by the Fiveable Content Team • Last updated September 2025

📊Probability and Statistics

Unit & Topic Study Guides

10.1 Pearson correlation coefficient

10.2 Spearman rank correlation

10.3 Simple linear regression model

10.4 Least squares estimation

10.5 Inference for regression parameters

The Pearson correlation coefficient measures the strength and direction of linear relationships between two continuous variables. It ranges from -1 to +1, with values closer to the extremes indicating stronger relationships. This statistical tool helps researchers quantify and interpret connections between variables.

Using Pearson correlation involves several assumptions, including linearity and normal distribution. The coefficient is calculated using a formula that considers covariance and standard deviations. Hypothesis testing determines if observed correlations are statistically significant, aiding in drawing meaningful conclusions from data analysis.

Definition of Pearson correlation coefficient

Pearson correlation coefficient measures the linear relationship between two continuous variables
Denoted by the symbol $r$, ranges from -1 to +1
Values closer to -1 or +1 indicate a stronger linear relationship, while values closer to 0 suggest a weaker or no linear relationship
Positive $r$ values indicate a direct relationship (as one variable increases, the other also increases), while negative $r$ values indicate an inverse relationship (as one variable increases, the other decreases)

Assumptions for using Pearson correlation

Both variables must be continuous and measured on an interval or ratio scale
The relationship between the variables should be linear
There should be no significant outliers in the data
The variables should be approximately normally distributed
Homoscedasticity assumes the variability in one variable is similar across all values of the other variable

Formula for calculating Pearson correlation

The Pearson correlation coefficient is calculated using the following formula: $r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$
$x_i$ and $y_i$ represent individual data points, $\bar{x}$ and $\bar{y}$ represent the means of the respective variables, and $n$ is the number of data points

Covariance in the numerator

The numerator of the Pearson correlation formula is the covariance between the two variables
Covariance measures how changes in one variable are associated with changes in another variable
Positive covariance indicates that the variables tend to increase or decrease together, while negative covariance suggests that as one variable increases, the other tends to decrease

Standard deviations in the denominator

The denominator of the Pearson correlation formula consists of the product of the standard deviations of the two variables
Standard deviation measures the dispersion of data points around the mean
Dividing the covariance by the product of standard deviations standardizes the correlation coefficient, making it independent of the scale of the variables

Range of possible values

The Pearson correlation coefficient ranges from -1 to +1
A value of +1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable increases proportionally
A value of -1 indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases proportionally
A value of 0 indicates no linear relationship between the variables

Positive vs negative correlation

Positive correlation occurs when an increase in one variable is associated with an increase in the other variable (height and weight)
Negative correlation occurs when an increase in one variable is associated with a decrease in the other variable (age and physical fitness)

Strength of correlation

The strength of the correlation is determined by the absolute value of the correlation coefficient
Values closer to 1 (either +1 or -1) indicate a stronger linear relationship
Values closer to 0 indicate a weaker linear relationship
As a general guideline, correlation coefficients between 0.1 and 0.3 are considered weak, 0.3 to 0.5 are moderate, and 0.5 to 1.0 are strong

Hypothesis testing with Pearson correlation

Hypothesis testing allows researchers to determine whether the observed correlation in a sample is statistically significant and can be generalized to the population
The null hypothesis ($H_0$) states that there is no significant correlation between the variables in the population ($\rho = 0$)
The alternative hypothesis ($H_a$) states that there is a significant correlation between the variables in the population ($\rho \neq 0$)

Null vs alternative hypotheses

The null hypothesis assumes that any observed correlation in the sample is due to chance and does not reflect a true relationship in the population
The alternative hypothesis suggests that the observed correlation in the sample is unlikely to have occurred by chance and reflects a true relationship in the population

Test statistic and p-value

The test statistic for Pearson correlation is calculated using the sample correlation coefficient ($r$) and the sample size ($n$)
The test statistic follows a t-distribution with $n-2$ degrees of freedom
The p-value represents the probability of obtaining the observed correlation coefficient (or a more extreme value) if the null hypothesis is true

Significance level and decision rule

The significance level ($\alpha$) is the probability of rejecting the null hypothesis when it is true (Type I error)
Common significance levels are 0.05 and 0.01
If the p-value is less than the chosen significance level, the null hypothesis is rejected, and the correlation is considered statistically significant
If the p-value is greater than the significance level, the null hypothesis is not rejected, and the correlation is not considered statistically significant

Interpretation of Pearson correlation

Interpreting Pearson correlation involves considering both the strength and significance of the relationship
A strong correlation (close to -1 or +1) suggests a consistent linear relationship between the variables
A significant correlation (p-value < $\alpha$) indicates that the observed relationship is unlikely to have occurred by chance

Strength vs significance

Strength refers to the magnitude of the correlation coefficient and the degree to which the variables are linearly related
Significance refers to the likelihood that the observed correlation is due to chance and not a true relationship in the population
A strong correlation may not always be statistically significant, especially with small sample sizes
A weak correlation may be statistically significant, particularly with large sample sizes

Correlation vs causation

Correlation does not imply causation
A significant correlation between two variables does not necessarily mean that one variable causes the other
Other factors, such as confounding variables or reverse causation, may explain the observed relationship
Additional research, such as controlled experiments, is needed to establish a causal relationship

Limitations of Pearson correlation

Pearson correlation has several limitations that should be considered when interpreting results
These limitations can affect the accuracy and generalizability of the findings

Sensitivity to outliers

Pearson correlation is sensitive to outliers, which are data points that are substantially different from the rest of the data
Outliers can have a disproportionate influence on the correlation coefficient, potentially leading to misleading results
It is essential to identify and address outliers before calculating Pearson correlation

Assumption of linearity

Pearson correlation assumes a linear relationship between the variables
If the relationship is nonlinear (curvilinear), Pearson correlation may not accurately capture the true nature of the relationship
Scatterplots can help assess the linearity assumption visually

Inability to detect nonlinear relationships

Pearson correlation is not designed to detect nonlinear relationships between variables
Even if a strong nonlinear relationship exists, Pearson correlation may yield a low or non-significant coefficient
Other techniques, such as polynomial regression or nonlinear regression, may be more appropriate for examining nonlinear relationships

Alternatives to Pearson correlation

When the assumptions of Pearson correlation are violated or the data is not continuous, alternative correlation measures can be used

Spearman rank correlation

Spearman rank correlation is a non-parametric measure that assesses the monotonic relationship between two variables
It is based on the ranks of the data points rather than their actual values
Spearman correlation is less sensitive to outliers and does not assume a linear relationship
It is suitable for ordinal data or when the relationship between variables is monotonic but not necessarily linear

Kendall's tau correlation

Kendall's tau correlation is another non-parametric measure that assesses the ordinal association between two variables
It is based on the number of concordant and discordant pairs in the data
Kendall's tau is less sensitive to outliers and does not assume a linear relationship
It is particularly useful for small sample sizes or when there are many tied ranks in the data

Applications of Pearson correlation

Pearson correlation is widely used in various fields, including social sciences, natural sciences, and business, to examine relationships between variables

Identifying linear relationships

Pearson correlation helps identify the presence, strength, and direction of linear relationships between two continuous variables
This information can be valuable for understanding how changes in one variable are associated with changes in another
Examples include examining the relationship between study time and exam scores or between income and life satisfaction

Validating research hypotheses

Researchers often use Pearson correlation to test hypotheses about the relationship between variables
A significant correlation can provide support for a hypothesized relationship
For example, a researcher may hypothesize that there is a positive correlation between job satisfaction and employee productivity

Informing further analyses

Pearson correlation can be used as a preliminary step to inform subsequent analyses
A strong correlation between variables may suggest that they are suitable for inclusion in a multiple regression model
Conversely, a weak or non-significant correlation may indicate that the variables are not closely related and may not contribute significantly to a predictive model

📊Probability and Statistics Unit 10 Review

10.1 Pearson correlation coefficient

📊Probability and Statistics Unit 10 Review

10.1 Pearson correlation coefficient

Unit & Topic Study Guides

Definition of Pearson correlation coefficient

Assumptions for using Pearson correlation

Formula for calculating Pearson correlation

Covariance in the numerator

Standard deviations in the denominator

Range of possible values

Positive vs negative correlation

Strength of correlation

Hypothesis testing with Pearson correlation

Null vs alternative hypotheses

Test statistic and p-value

Significance level and decision rule

Interpretation of Pearson correlation

Strength vs significance

Correlation vs causation

Limitations of Pearson correlation

Sensitivity to outliers

Assumption of linearity

Inability to detect nonlinear relationships

Alternatives to Pearson correlation

Spearman rank correlation

Kendall's tau correlation

Applications of Pearson correlation

Identifying linear relationships

Validating research hypotheses

Informing further analyses

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

📊Probability and Statistics
Unit 10 Review