🤖Statistical Prediction Unit 10 Review

10.2 K-means Clustering and Partitioning Methods

🤖Statistical Prediction
Unit 10 Review

10.2 K-means Clustering and Partitioning Methods

Written by the Fiveable Content Team • Last updated September 2025

🤖Statistical Prediction

Unit & Topic Study Guides

10.1 Principal Component Analysis (PCA) and Dimensionality Reduction

10.2 K-means Clustering and Partitioning Methods

10.3 Hierarchical Clustering and Density-Based Methods

K-means clustering is a popular unsupervised learning technique that groups similar data points into clusters. It's a key method in the broader field of unsupervised learning, which aims to find patterns in data without predefined labels or categories.

This section covers the nuts and bolts of K-means, including how to assign observations to clusters and evaluate cluster quality. It also explores centroid initialization techniques and alternative partitioning methods, helping you understand the practical aspects of applying clustering algorithms to real-world data.

Cluster Assignment and Evaluation

Assigning Observations to Clusters

K-means clustering assigns each observation to the cluster with the nearest centroid (cluster center) based on Euclidean distance
Euclidean distance measures the straight-line distance between two points in a multi-dimensional space
- Calculated as the square root of the sum of squared differences between corresponding coordinates
- Example: Distance between points (1, 2) and (4, 6) is $\sqrt{(4-1)^2 + (6-2)^2} = \sqrt{3^2 + 4^2} = 5$
Observations are iteratively reassigned to clusters until convergence is reached (no further changes in cluster assignments)
Final cluster assignments minimize the within-cluster sum of squares (WCSS)

Evaluating Cluster Quality

Within-cluster sum of squares (WCSS) measures the compactness of clusters
- Calculated as the sum of squared distances between each observation and its assigned cluster centroid
- Lower WCSS indicates more compact and well-separated clusters
- Example: WCSS for a cluster with points (1, 2), (2, 3), and (3, 4) and centroid (2, 3) is $(1-2)^2 + (2-3)^2 + (2-2)^2 + (3-3)^2 + (3-2)^2 + (4-3)^2 = 3$
Silhouette score assesses both the compactness and separation of clusters
- Ranges from -1 to 1, with higher values indicating better-defined clusters
- Compares the average distance of an observation to other observations within its cluster (cohesion) to the average distance to observations in the nearest neighboring cluster (separation)
- Example: An observation with a silhouette score of 0.8 is well-matched to its assigned cluster and poorly-matched to neighboring clusters

Centroid Initialization Techniques

Importance of Centroid Initialization

Centroid refers to the center point of a cluster, typically represented by the mean of the observations within the cluster
K-means clustering is sensitive to the initial placement of centroids, as it can converge to different local optima depending on the starting positions
Poor initialization can lead to suboptimal clustering results and longer convergence times
Example: Initializing centroids far from the true cluster centers may cause the algorithm to converge to a less optimal solution

K-means++ Initialization

K-means++ is a centroid initialization technique that aims to improve the quality and consistency of clustering results
Selects the first centroid randomly from the dataset
Subsequent centroids are chosen with probability proportional to their squared distance from the nearest existing centroid
- Observations farther away from existing centroids have a higher probability of being selected as new centroids
Encourages centroids to be well-spread across the dataset, reducing the likelihood of converging to suboptimal solutions
Example: In a dataset with distinct clusters, K-means++ is more likely to initialize centroids within each true cluster, leading to better final clustering results compared to random initialization

Alternative Partitioning Methods

Partitioning Around Medoids (PAM)

PAM is a variant of K-means that uses medoids instead of centroids as cluster representatives
Medoid is the observation within a cluster that minimizes the sum of distances to all other observations in the cluster
- More robust to outliers compared to centroids, as medoids are actual observations from the dataset
PAM iteratively assigns observations to clusters based on their distance to the medoids and updates the medoids to minimize the total distance within each cluster
Example: In a dataset with outliers, PAM may produce more stable and interpretable clustering results compared to K-means, as the medoids are less affected by extreme values

Determining the Optimal Number of Clusters

Elbow method is a graphical technique for determining the appropriate number of clusters (K) in a dataset
Plots the within-cluster sum of squares (WCSS) against the number of clusters (K)
Optimal K is identified as the "elbow" point, where the rate of decrease in WCSS slows significantly with additional clusters
- Balances the trade-off between model complexity (more clusters) and the marginal gain in cluster compactness
Example: In an elbow plot, if the WCSS decreases sharply from K=1 to K=3, then levels off for K>3, the optimal number of clusters would be chosen as 3

🤖Statistical Prediction Unit 10 Review

10.2 K-means Clustering and Partitioning Methods

🤖Statistical Prediction
Unit 10 Review

10.2 K-means Clustering and Partitioning Methods

Unit & Topic Study Guides

Cluster Assignment and Evaluation

Assigning Observations to Clusters

Evaluating Cluster Quality

Centroid Initialization Techniques

Importance of Centroid Initialization

K-means++ Initialization

Alternative Partitioning Methods

Partitioning Around Medoids (PAM)

Determining the Optimal Number of Clusters

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes