Unsupervised learning evaluation metrics help us judge how well our clustering algorithms are performing without labeled data. These metrics fall into two categories: internal validation, which uses the data itself, and external validation, which compares results to known information.
Internal metrics like silhouette score and inertia measure cluster compactness and separation. External metrics like adjusted Rand index compare clustering results to known groupings. Understanding these metrics is crucial for selecting the best clustering approach for your data.
Internal Validation Metrics
Silhouette Score and Calinski-Harabasz Index
- Silhouette Score measures how well an observation fits into its assigned cluster compared to other clusters
- Calculates the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample
- Silhouette coefficient for a sample is
- Ranges from -1 to 1, where a high value indicates that the object is well matched to its cluster and poorly matched to neighboring clusters
- Calinski-Harabasz Index, also known as the Variance Ratio Criterion, evaluates the cluster validity based on the average between-cluster and within-cluster sum of squares
- Defined as where is the between-cluster sum of squares, is the within-cluster sum of squares, is the number of clusters, and is the total number of observations
- A higher Calinski-Harabasz score relates to a model with better defined clusters
Davies-Bouldin Index and Dunn Index
- Davies-Bouldin Index measures the average similarity between each cluster and its most similar cluster
- Calculates the ratio of within-cluster distances to between-cluster distances for each cluster pair
- A lower Davies-Bouldin Index indicates better separation between the clusters and more compact clusters
- Aims to minimize the average similarity between each cluster and its most similar cluster
- Dunn Index assesses the compactness and separation of clusters
- Defined as the ratio between the minimal inter-cluster distance to the maximal intra-cluster distance
- A higher Dunn Index implies better clustering, as it indicates that the clusters are compact and well-separated
- Sensitive to outliers as it only considers the maximum intra-cluster distance and minimum inter-cluster distance
Inertia
- Inertia, or within-cluster sum-of-squares (WSS), measures the compactness of the clustering
- Calculated as the sum of squared distances of samples to their closest cluster center
- A lower inertia indicates more compact clusters
- Often used in combination with other metrics (silhouette score) to determine the optimal number of clusters
- Inertia decreases monotonically as the number of clusters increases, so it alone cannot determine the optimal number of clusters
External Validation Metrics
Adjusted Rand Index and Mutual Information Score
- Adjusted Rand Index (ARI) measures the similarity between two clusterings, adjusting for chance groupings
- Calculates the number of pairs of elements that are either in the same group or in different groups in both clusterings
- Ranges from -1 to 1, where 1 indicates perfect agreement between the clusterings, 0 represents the expected score of random labelings, and negative values indicate less agreement than expected by chance
- Adjusts the Rand Index to account for the expected similarity of random clusterings
- Mutual Information Score quantifies the amount of information shared between two clusterings
- Measures how much knowing one clustering reduces the uncertainty about the other
- Ranges from 0 to min(H(U), H(V)), where U and V are the two clusterings and H(.) is the entropy
- A higher Mutual Information Score suggests a higher agreement between the clusterings
- Can be normalized to adjust for the number of clusters and samples
Cophenetic Correlation Coefficient
- Cophenetic Correlation Coefficient measures how faithfully a hierarchical clustering preserves the pairwise distances between the original data points
- Compares the distances between samples in the original space to the distances between samples in the hierarchical clustering
- Calculated as the Pearson correlation between the original distances and the cophenetic distances
- Ranges from -1 to 1, where a value closer to 1 indicates that the hierarchical clustering accurately preserves the original distances
- Helps to assess the quality of a hierarchical clustering and to compare different linkage methods (single, complete, average)