Fiveable

๐ŸŽฃStatistical Inference Unit 3 Review

QR code for Statistical Inference practice questions

3.3 Covariance and Correlation

๐ŸŽฃStatistical Inference
Unit 3 Review

3.3 Covariance and Correlation

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸŽฃStatistical Inference
Unit & Topic Study Guides

Covariance and correlation are key tools for understanding relationships between variables. Covariance measures joint variability, while correlation standardizes this measure for easier interpretation. Both help quantify linear associations in data.

These measures have important applications but also limitations. They only capture linear relationships, can be affected by outliers, and don't imply causation. Understanding these nuances is crucial for proper statistical analysis and interpretation.

Measures of Association

Covariance calculation and interpretation

  • Covariance quantifies joint variability between two random variables indicates direction of linear relationship
  • Calculate using formula $Cov(X,Y) = E[(X - \mu_X)(Y - \mu_Y)]$ or $Cov(X,Y) = E[XY] - E[X]E[Y]$
  • Positive covariance suggests variables move in same direction (stock prices and company profits)
  • Negative covariance indicates variables move oppositely (temperature and heating costs)
  • Zero covariance points to no linear relationship (shoe size and test scores)
  • Units expressed as product of two variables' units (mยทkg for height and weight)

Properties of covariance

  • Symmetry property states $Cov(X,Y) = Cov(Y,X)$
  • Linearity allows $Cov(aX + b, Y) = aCov(X,Y)$ for constants a and b
  • Variance emerges as special case where $Cov(X,X) = Var(X)$
  • Independent variables yield zero covariance but reverse not always true
  • Covariance matrix summarizes relationships between multiple variables (gene expression data)
  • Scale-dependence limits interpretability across different variable scales

Correlation coefficient basics

  • Correlation standardizes covariance offering scale-independent measure of linear relationship
  • Pearson's correlation coefficient computed as $\rho_{X,Y} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}$
  • Values range from -1 to 1 indicating strength and direction of relationship
  • Perfect positive (1) shows variables increase together (height and weight)
  • Perfect negative (-1) indicates inverse relationship (temperature and heating bill)
  • Zero correlation suggests no linear relationship (shoe size and intelligence)
  • Sample correlation estimates population parameter from observed data

Limitations of correlation

  • Only captures linear relationships missing complex patterns (sine wave)
  • Outliers can significantly distort correlation value
  • Fails to detect non-monotonic relationships (U-shaped curves)
  • Correlation โ‰  causation (ice cream sales and crime rates)
  • Spurious correlations arise from coincidental data patterns (stork populations and birth rates)
  • Alternative measures include Spearman's rank and Kendall's tau for non-linear relationships
  • Range restriction can artificially lower correlation (college GPAs)
  • Measurement error introduces noise reducing observed correlation
  • Scatterplots crucial for visualizing relationship before interpreting correlation coefficient
  • Ecological fallacy warns against applying group-level correlations to individuals (country wealth vs individual income)