🧰Engineering Applications of Statistics Unit 12 Review

12.1 Principal component analysis (PCA)

🧰Engineering Applications of Statistics
Unit 12 Review

12.1 Principal component analysis (PCA)

Written by the Fiveable Content Team • Last updated September 2025

🧰Engineering Applications of Statistics

Unit & Topic Study Guides

12.1 Principal component analysis (PCA)

12.2 Factor analysis

12.3 Discriminant analysis

12.4 Cluster analysis

PCA transforms correlated variables into uncorrelated principal components, capturing maximum variance with fewer dimensions. It's a powerful tool for data compression, noise reduction, and visualizing high-dimensional data in lower-dimensional spaces.

In multivariate analysis, PCA helps identify patterns in complex datasets. It's widely used in fields like image processing, bioinformatics, and finance, making it easier to analyze and interpret large, multidimensional datasets.

Data Dimensionality Reduction with PCA

Principal Component Analysis (PCA) Basics

Principal Component Analysis (PCA) transforms a set of correlated variables into a new set of uncorrelated variables called principal components
PCA captures the maximum variance in the data using the fewest number of principal components, effectively reducing the dimensionality of the dataset while retaining the most important information
The first principal component captures the largest amount of variance in the data, and each subsequent component captures the remaining variance in decreasing order
PCA is useful for data compression (reducing storage requirements), noise reduction (removing irrelevant information), and visualization of high-dimensional data in lower-dimensional spaces (2D or 3D plots)

Benefits and Applications of PCA

PCA helps identify patterns and relationships in high-dimensional data by projecting it onto a lower-dimensional space
It can be used as a preprocessing step for other machine learning algorithms, improving their performance and reducing computational complexity
PCA is widely applied in fields such as image processing (face recognition), bioinformatics (gene expression analysis), and finance (stock market analysis)
By reducing the number of variables, PCA can help mitigate the curse of dimensionality, where the number of samples required for accurate modeling grows exponentially with the number of features

Transforming Correlated Variables with PCA

Data Standardization and Covariance Matrix

The first step in PCA is to standardize the data by subtracting the mean and dividing by the standard deviation for each variable, ensuring all variables have zero mean and unit variance
Standardization ensures that variables with larger scales do not dominate the analysis and that all variables contribute equally to the principal components
The covariance matrix or correlation matrix of the standardized data is computed to measure the relationships between the variables
The covariance matrix captures the pairwise covariances between variables, while the correlation matrix captures the pairwise correlations (normalized covariances)

Eigendecomposition and Principal Component Selection

Eigenvalues and eigenvectors of the covariance or correlation matrix are calculated, where eigenvectors represent the directions of the principal components, and eigenvalues represent the amount of variance explained by each component
The eigenvectors are orthogonal (perpendicular) to each other, ensuring that the principal components are uncorrelated
The eigenvectors are sorted in descending order based on their corresponding eigenvalues, and the top k eigenvectors are selected to form the principal components
The original data is projected onto the selected principal components to obtain the transformed dataset with reduced dimensionality
The transformed data points, called principal component scores, represent the original data in the new coordinate system defined by the principal components

Interpreting Principal Components and Variance

Principal Component Loadings and Interpretation

Each principal component is a linear combination of the original variables, with coefficients called loadings that indicate the importance of each variable in the component
The loadings can be interpreted to understand the relationships between the original variables and the principal components, helping to identify the underlying structure of the data
Variables with high absolute loadings (positive or negative) on a principal component have a strong influence on that component
Variables with similar loadings on a principal component are positively correlated, while variables with opposite signs are negatively correlated
Domain knowledge is crucial for interpreting the principal components and assigning meaningful labels to them

Variance Explained and Scree Plots

The eigenvalues associated with each principal component represent the amount of variance explained by that component, and their sum equals the total variance in the data
The proportion of variance explained by each principal component can be calculated by dividing its eigenvalue by the sum of all eigenvalues, providing insight into the relative importance of each component
Scree plots, which display the eigenvalues in descending order, can be used to visually assess the contribution of each principal component to the total variance and help determine the optimal number of components to retain
The scree plot typically shows a sharp drop in eigenvalues, followed by a leveling off, and the "elbow" point indicates the number of components that capture the majority of the variance in the data

Selecting Optimal Principal Components

Criteria for Retaining Principal Components

Retaining too few principal components may result in loss of important information, while retaining too many components may include noise and hinder interpretation
The cumulative proportion of variance explained by the selected principal components is a key criterion in determining the optimal number of components to retain
A common rule of thumb is to retain principal components that cumulatively explain a certain percentage (e.g., 70% or 80%) of the total variance in the data
The elbow method involves plotting the eigenvalues or the cumulative proportion of variance explained against the number of components and identifying the "elbow" point where the curve starts to level off, indicating the optimal number of components
The Kaiser-Guttman criterion suggests retaining components with eigenvalues greater than 1, as they explain more variance than an average single variable

Cross-Validation and Interpretability

Cross-validation techniques, such as k-fold cross-validation, can be used to assess the performance of PCA with different numbers of retained components and help select the optimal number based on a chosen evaluation metric (reconstruction error or classification accuracy)
The interpretability and practicality of the retained principal components should also be considered, ensuring that the selected components align with the domain knowledge and research objectives
In some cases, retaining a smaller number of easily interpretable components may be preferred over a larger number of components that explain slightly more variance but are harder to interpret
The choice of the optimal number of principal components depends on the specific dataset, research goals, and the trade-off between dimensionality reduction and information retention

🧰Engineering Applications of Statistics Unit 12 Review

12.1 Principal component analysis (PCA)

🧰Engineering Applications of Statistics
Unit 12 Review

12.1 Principal component analysis (PCA)

Unit & Topic Study Guides

Data Dimensionality Reduction with PCA

Principal Component Analysis (PCA) Basics

Benefits and Applications of PCA

Transforming Correlated Variables with PCA

Data Standardization and Covariance Matrix

Eigendecomposition and Principal Component Selection

Interpreting Principal Components and Variance

Principal Component Loadings and Interpretation

Variance Explained and Scree Plots

Selecting Optimal Principal Components

Criteria for Retaining Principal Components

Cross-Validation and Interpretability

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes