Fiveable

๐Ÿ“ŠProbability and Statistics Unit 10 Review

QR code for Probability and Statistics practice questions

10.1 Pearson correlation coefficient

๐Ÿ“ŠProbability and Statistics
Unit 10 Review

10.1 Pearson correlation coefficient

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“ŠProbability and Statistics
Unit & Topic Study Guides

The Pearson correlation coefficient measures the strength and direction of linear relationships between two continuous variables. It ranges from -1 to +1, with values closer to the extremes indicating stronger relationships. This statistical tool helps researchers quantify and interpret connections between variables.

Using Pearson correlation involves several assumptions, including linearity and normal distribution. The coefficient is calculated using a formula that considers covariance and standard deviations. Hypothesis testing determines if observed correlations are statistically significant, aiding in drawing meaningful conclusions from data analysis.

Definition of Pearson correlation coefficient

  • Pearson correlation coefficient measures the linear relationship between two continuous variables
  • Denoted by the symbol $r$, ranges from -1 to +1
  • Values closer to -1 or +1 indicate a stronger linear relationship, while values closer to 0 suggest a weaker or no linear relationship
  • Positive $r$ values indicate a direct relationship (as one variable increases, the other also increases), while negative $r$ values indicate an inverse relationship (as one variable increases, the other decreases)

Assumptions for using Pearson correlation

  • Both variables must be continuous and measured on an interval or ratio scale
  • The relationship between the variables should be linear
  • There should be no significant outliers in the data
  • The variables should be approximately normally distributed
  • Homoscedasticity assumes the variability in one variable is similar across all values of the other variable

Formula for calculating Pearson correlation

  • The Pearson correlation coefficient is calculated using the following formula: r=โˆ‘i=1n(xiโˆ’xห‰)(yiโˆ’yห‰)โˆ‘i=1n(xiโˆ’xห‰)2โˆ‘i=1n(yiโˆ’yห‰)2r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}
  • $x_i$ and $y_i$ represent individual data points, $\bar{x}$ and $\bar{y}$ represent the means of the respective variables, and $n$ is the number of data points

Covariance in the numerator

  • The numerator of the Pearson correlation formula is the covariance between the two variables
  • Covariance measures how changes in one variable are associated with changes in another variable
  • Positive covariance indicates that the variables tend to increase or decrease together, while negative covariance suggests that as one variable increases, the other tends to decrease

Standard deviations in the denominator

  • The denominator of the Pearson correlation formula consists of the product of the standard deviations of the two variables
  • Standard deviation measures the dispersion of data points around the mean
  • Dividing the covariance by the product of standard deviations standardizes the correlation coefficient, making it independent of the scale of the variables

Range of possible values

  • The Pearson correlation coefficient ranges from -1 to +1
  • A value of +1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable increases proportionally
  • A value of -1 indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases proportionally
  • A value of 0 indicates no linear relationship between the variables

Positive vs negative correlation

  • Positive correlation occurs when an increase in one variable is associated with an increase in the other variable (height and weight)
  • Negative correlation occurs when an increase in one variable is associated with a decrease in the other variable (age and physical fitness)

Strength of correlation

  • The strength of the correlation is determined by the absolute value of the correlation coefficient
  • Values closer to 1 (either +1 or -1) indicate a stronger linear relationship
  • Values closer to 0 indicate a weaker linear relationship
  • As a general guideline, correlation coefficients between 0.1 and 0.3 are considered weak, 0.3 to 0.5 are moderate, and 0.5 to 1.0 are strong

Hypothesis testing with Pearson correlation

  • Hypothesis testing allows researchers to determine whether the observed correlation in a sample is statistically significant and can be generalized to the population
  • The null hypothesis ($H_0$) states that there is no significant correlation between the variables in the population ($\rho = 0$)
  • The alternative hypothesis ($H_a$) states that there is a significant correlation between the variables in the population ($\rho \neq 0$)

Null vs alternative hypotheses

  • The null hypothesis assumes that any observed correlation in the sample is due to chance and does not reflect a true relationship in the population
  • The alternative hypothesis suggests that the observed correlation in the sample is unlikely to have occurred by chance and reflects a true relationship in the population

Test statistic and p-value

  • The test statistic for Pearson correlation is calculated using the sample correlation coefficient ($r$) and the sample size ($n$)
  • The test statistic follows a t-distribution with $n-2$ degrees of freedom
  • The p-value represents the probability of obtaining the observed correlation coefficient (or a more extreme value) if the null hypothesis is true

Significance level and decision rule

  • The significance level ($\alpha$) is the probability of rejecting the null hypothesis when it is true (Type I error)
  • Common significance levels are 0.05 and 0.01
  • If the p-value is less than the chosen significance level, the null hypothesis is rejected, and the correlation is considered statistically significant
  • If the p-value is greater than the significance level, the null hypothesis is not rejected, and the correlation is not considered statistically significant

Interpretation of Pearson correlation

  • Interpreting Pearson correlation involves considering both the strength and significance of the relationship
  • A strong correlation (close to -1 or +1) suggests a consistent linear relationship between the variables
  • A significant correlation (p-value < $\alpha$) indicates that the observed relationship is unlikely to have occurred by chance

Strength vs significance

  • Strength refers to the magnitude of the correlation coefficient and the degree to which the variables are linearly related
  • Significance refers to the likelihood that the observed correlation is due to chance and not a true relationship in the population
  • A strong correlation may not always be statistically significant, especially with small sample sizes
  • A weak correlation may be statistically significant, particularly with large sample sizes

Correlation vs causation

  • Correlation does not imply causation
  • A significant correlation between two variables does not necessarily mean that one variable causes the other
  • Other factors, such as confounding variables or reverse causation, may explain the observed relationship
  • Additional research, such as controlled experiments, is needed to establish a causal relationship

Limitations of Pearson correlation

  • Pearson correlation has several limitations that should be considered when interpreting results
  • These limitations can affect the accuracy and generalizability of the findings

Sensitivity to outliers

  • Pearson correlation is sensitive to outliers, which are data points that are substantially different from the rest of the data
  • Outliers can have a disproportionate influence on the correlation coefficient, potentially leading to misleading results
  • It is essential to identify and address outliers before calculating Pearson correlation

Assumption of linearity

  • Pearson correlation assumes a linear relationship between the variables
  • If the relationship is nonlinear (curvilinear), Pearson correlation may not accurately capture the true nature of the relationship
  • Scatterplots can help assess the linearity assumption visually

Inability to detect nonlinear relationships

  • Pearson correlation is not designed to detect nonlinear relationships between variables
  • Even if a strong nonlinear relationship exists, Pearson correlation may yield a low or non-significant coefficient
  • Other techniques, such as polynomial regression or nonlinear regression, may be more appropriate for examining nonlinear relationships

Alternatives to Pearson correlation

  • When the assumptions of Pearson correlation are violated or the data is not continuous, alternative correlation measures can be used

Spearman rank correlation

  • Spearman rank correlation is a non-parametric measure that assesses the monotonic relationship between two variables
  • It is based on the ranks of the data points rather than their actual values
  • Spearman correlation is less sensitive to outliers and does not assume a linear relationship
  • It is suitable for ordinal data or when the relationship between variables is monotonic but not necessarily linear

Kendall's tau correlation

  • Kendall's tau correlation is another non-parametric measure that assesses the ordinal association between two variables
  • It is based on the number of concordant and discordant pairs in the data
  • Kendall's tau is less sensitive to outliers and does not assume a linear relationship
  • It is particularly useful for small sample sizes or when there are many tied ranks in the data

Applications of Pearson correlation

  • Pearson correlation is widely used in various fields, including social sciences, natural sciences, and business, to examine relationships between variables

Identifying linear relationships

  • Pearson correlation helps identify the presence, strength, and direction of linear relationships between two continuous variables
  • This information can be valuable for understanding how changes in one variable are associated with changes in another
  • Examples include examining the relationship between study time and exam scores or between income and life satisfaction

Validating research hypotheses

  • Researchers often use Pearson correlation to test hypotheses about the relationship between variables
  • A significant correlation can provide support for a hypothesized relationship
  • For example, a researcher may hypothesize that there is a positive correlation between job satisfaction and employee productivity

Informing further analyses

  • Pearson correlation can be used as a preliminary step to inform subsequent analyses
  • A strong correlation between variables may suggest that they are suitable for inclusion in a multiple regression model
  • Conversely, a weak or non-significant correlation may indicate that the variables are not closely related and may not contribute significantly to a predictive model