Fiveable

๐ŸงฐEngineering Applications of Statistics Unit 12 Review

QR code for Engineering Applications of Statistics practice questions

12.1 Principal component analysis (PCA)

๐ŸงฐEngineering Applications of Statistics
Unit 12 Review

12.1 Principal component analysis (PCA)

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸงฐEngineering Applications of Statistics
Unit & Topic Study Guides

PCA transforms correlated variables into uncorrelated principal components, capturing maximum variance with fewer dimensions. It's a powerful tool for data compression, noise reduction, and visualizing high-dimensional data in lower-dimensional spaces.

In multivariate analysis, PCA helps identify patterns in complex datasets. It's widely used in fields like image processing, bioinformatics, and finance, making it easier to analyze and interpret large, multidimensional datasets.

Data Dimensionality Reduction with PCA

Principal Component Analysis (PCA) Basics

  • Principal Component Analysis (PCA) transforms a set of correlated variables into a new set of uncorrelated variables called principal components
  • PCA captures the maximum variance in the data using the fewest number of principal components, effectively reducing the dimensionality of the dataset while retaining the most important information
  • The first principal component captures the largest amount of variance in the data, and each subsequent component captures the remaining variance in decreasing order
  • PCA is useful for data compression (reducing storage requirements), noise reduction (removing irrelevant information), and visualization of high-dimensional data in lower-dimensional spaces (2D or 3D plots)

Benefits and Applications of PCA

  • PCA helps identify patterns and relationships in high-dimensional data by projecting it onto a lower-dimensional space
  • It can be used as a preprocessing step for other machine learning algorithms, improving their performance and reducing computational complexity
  • PCA is widely applied in fields such as image processing (face recognition), bioinformatics (gene expression analysis), and finance (stock market analysis)
  • By reducing the number of variables, PCA can help mitigate the curse of dimensionality, where the number of samples required for accurate modeling grows exponentially with the number of features

Transforming Correlated Variables with PCA

Data Standardization and Covariance Matrix

  • The first step in PCA is to standardize the data by subtracting the mean and dividing by the standard deviation for each variable, ensuring all variables have zero mean and unit variance
  • Standardization ensures that variables with larger scales do not dominate the analysis and that all variables contribute equally to the principal components
  • The covariance matrix or correlation matrix of the standardized data is computed to measure the relationships between the variables
  • The covariance matrix captures the pairwise covariances between variables, while the correlation matrix captures the pairwise correlations (normalized covariances)

Eigendecomposition and Principal Component Selection

  • Eigenvalues and eigenvectors of the covariance or correlation matrix are calculated, where eigenvectors represent the directions of the principal components, and eigenvalues represent the amount of variance explained by each component
  • The eigenvectors are orthogonal (perpendicular) to each other, ensuring that the principal components are uncorrelated
  • The eigenvectors are sorted in descending order based on their corresponding eigenvalues, and the top k eigenvectors are selected to form the principal components
  • The original data is projected onto the selected principal components to obtain the transformed dataset with reduced dimensionality
  • The transformed data points, called principal component scores, represent the original data in the new coordinate system defined by the principal components

Interpreting Principal Components and Variance

Principal Component Loadings and Interpretation

  • Each principal component is a linear combination of the original variables, with coefficients called loadings that indicate the importance of each variable in the component
  • The loadings can be interpreted to understand the relationships between the original variables and the principal components, helping to identify the underlying structure of the data
  • Variables with high absolute loadings (positive or negative) on a principal component have a strong influence on that component
  • Variables with similar loadings on a principal component are positively correlated, while variables with opposite signs are negatively correlated
  • Domain knowledge is crucial for interpreting the principal components and assigning meaningful labels to them

Variance Explained and Scree Plots

  • The eigenvalues associated with each principal component represent the amount of variance explained by that component, and their sum equals the total variance in the data
  • The proportion of variance explained by each principal component can be calculated by dividing its eigenvalue by the sum of all eigenvalues, providing insight into the relative importance of each component
  • Scree plots, which display the eigenvalues in descending order, can be used to visually assess the contribution of each principal component to the total variance and help determine the optimal number of components to retain
  • The scree plot typically shows a sharp drop in eigenvalues, followed by a leveling off, and the "elbow" point indicates the number of components that capture the majority of the variance in the data

Selecting Optimal Principal Components

Criteria for Retaining Principal Components

  • Retaining too few principal components may result in loss of important information, while retaining too many components may include noise and hinder interpretation
  • The cumulative proportion of variance explained by the selected principal components is a key criterion in determining the optimal number of components to retain
  • A common rule of thumb is to retain principal components that cumulatively explain a certain percentage (e.g., 70% or 80%) of the total variance in the data
  • The elbow method involves plotting the eigenvalues or the cumulative proportion of variance explained against the number of components and identifying the "elbow" point where the curve starts to level off, indicating the optimal number of components
  • The Kaiser-Guttman criterion suggests retaining components with eigenvalues greater than 1, as they explain more variance than an average single variable

Cross-Validation and Interpretability

  • Cross-validation techniques, such as k-fold cross-validation, can be used to assess the performance of PCA with different numbers of retained components and help select the optimal number based on a chosen evaluation metric (reconstruction error or classification accuracy)
  • The interpretability and practicality of the retained principal components should also be considered, ensuring that the selected components align with the domain knowledge and research objectives
  • In some cases, retaining a smaller number of easily interpretable components may be preferred over a larger number of components that explain slightly more variance but are harder to interpret
  • The choice of the optimal number of principal components depends on the specific dataset, research goals, and the trade-off between dimensionality reduction and information retention