Fiveable

๐Ÿค–Statistical Prediction Unit 10 Review

QR code for Statistical Prediction practice questions

10.2 K-means Clustering and Partitioning Methods

๐Ÿค–Statistical Prediction
Unit 10 Review

10.2 K-means Clustering and Partitioning Methods

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿค–Statistical Prediction
Unit & Topic Study Guides

K-means clustering is a popular unsupervised learning technique that groups similar data points into clusters. It's a key method in the broader field of unsupervised learning, which aims to find patterns in data without predefined labels or categories.

This section covers the nuts and bolts of K-means, including how to assign observations to clusters and evaluate cluster quality. It also explores centroid initialization techniques and alternative partitioning methods, helping you understand the practical aspects of applying clustering algorithms to real-world data.

Cluster Assignment and Evaluation

Assigning Observations to Clusters

  • K-means clustering assigns each observation to the cluster with the nearest centroid (cluster center) based on Euclidean distance
  • Euclidean distance measures the straight-line distance between two points in a multi-dimensional space
    • Calculated as the square root of the sum of squared differences between corresponding coordinates
    • Example: Distance between points (1, 2) and (4, 6) is $\sqrt{(4-1)^2 + (6-2)^2} = \sqrt{3^2 + 4^2} = 5$
  • Observations are iteratively reassigned to clusters until convergence is reached (no further changes in cluster assignments)
  • Final cluster assignments minimize the within-cluster sum of squares (WCSS)

Evaluating Cluster Quality

  • Within-cluster sum of squares (WCSS) measures the compactness of clusters
    • Calculated as the sum of squared distances between each observation and its assigned cluster centroid
    • Lower WCSS indicates more compact and well-separated clusters
    • Example: WCSS for a cluster with points (1, 2), (2, 3), and (3, 4) and centroid (2, 3) is $(1-2)^2 + (2-3)^2 + (2-2)^2 + (3-3)^2 + (3-2)^2 + (4-3)^2 = 3$
  • Silhouette score assesses both the compactness and separation of clusters
    • Ranges from -1 to 1, with higher values indicating better-defined clusters
    • Compares the average distance of an observation to other observations within its cluster (cohesion) to the average distance to observations in the nearest neighboring cluster (separation)
    • Example: An observation with a silhouette score of 0.8 is well-matched to its assigned cluster and poorly-matched to neighboring clusters

Centroid Initialization Techniques

Importance of Centroid Initialization

  • Centroid refers to the center point of a cluster, typically represented by the mean of the observations within the cluster
  • K-means clustering is sensitive to the initial placement of centroids, as it can converge to different local optima depending on the starting positions
  • Poor initialization can lead to suboptimal clustering results and longer convergence times
  • Example: Initializing centroids far from the true cluster centers may cause the algorithm to converge to a less optimal solution

K-means++ Initialization

  • K-means++ is a centroid initialization technique that aims to improve the quality and consistency of clustering results
  • Selects the first centroid randomly from the dataset
  • Subsequent centroids are chosen with probability proportional to their squared distance from the nearest existing centroid
    • Observations farther away from existing centroids have a higher probability of being selected as new centroids
  • Encourages centroids to be well-spread across the dataset, reducing the likelihood of converging to suboptimal solutions
  • Example: In a dataset with distinct clusters, K-means++ is more likely to initialize centroids within each true cluster, leading to better final clustering results compared to random initialization

Alternative Partitioning Methods

Partitioning Around Medoids (PAM)

  • PAM is a variant of K-means that uses medoids instead of centroids as cluster representatives
  • Medoid is the observation within a cluster that minimizes the sum of distances to all other observations in the cluster
    • More robust to outliers compared to centroids, as medoids are actual observations from the dataset
  • PAM iteratively assigns observations to clusters based on their distance to the medoids and updates the medoids to minimize the total distance within each cluster
  • Example: In a dataset with outliers, PAM may produce more stable and interpretable clustering results compared to K-means, as the medoids are less affected by extreme values

Determining the Optimal Number of Clusters

  • Elbow method is a graphical technique for determining the appropriate number of clusters (K) in a dataset
  • Plots the within-cluster sum of squares (WCSS) against the number of clusters (K)
  • Optimal K is identified as the "elbow" point, where the rate of decrease in WCSS slows significantly with additional clusters
    • Balances the trade-off between model complexity (more clusters) and the marginal gain in cluster compactness
  • Example: In an elbow plot, if the WCSS decreases sharply from K=1 to K=3, then levels off for K>3, the optimal number of clusters would be chosen as 3