Fiveable

๐Ÿ’ฟData Visualization Unit 11 Review

QR code for Data Visualization practice questions

11.2 Hierarchical and k-means clustering visualization

๐Ÿ’ฟData Visualization
Unit 11 Review

11.2 Hierarchical and k-means clustering visualization

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ’ฟData Visualization
Unit & Topic Study Guides

Hierarchical and k-means clustering are powerful tools for finding patterns in data. They group similar items together, revealing hidden structures and relationships. Each method has its strengths: hierarchical builds a tree-like structure, while k-means partitions data into a set number of clusters.

Visualizing clustering results helps us understand the data better. Dendrograms show the hierarchical structure, while heatmaps display patterns across variables. These visuals, combined with evaluation techniques, guide us in interpreting and validating our clustering results.

Hierarchical vs K-means Clustering

Clustering Methods Comparison

  • Hierarchical clustering builds a hierarchy of clusters by merging smaller clusters into larger ones (agglomerative approach) or dividing larger clusters into smaller ones (divisive approach)
    • Does not require specifying the number of clusters upfront
    • Produces a tree-like structure called a dendrogram, which shows the hierarchical relationships between clusters
    • Can handle different distance metrics (Euclidean, Manhattan, etc.) and linkage methods (single, complete, average)
    • Has a time complexity of O(n^3) for the agglomerative approach and O(2^n) for the divisive approach, where n is the number of data points
  • K-means clustering divides the data into a pre-specified number of clusters (k) by minimizing the sum of squared distances between data points and their assigned cluster centroids
    • Requires specifying the number of clusters (k) in advance
    • Assigns each data point to a specific cluster without capturing the hierarchical structure
    • Typically uses Euclidean distance and aims to minimize the within-cluster sum of squares
    • Has a time complexity of O(n * k * i), where n is the number of data points, k is the number of clusters, and i is the number of iterations

Clustering Algorithm Selection

  • Choose hierarchical clustering when
    • The hierarchical structure and relationships between clusters are of interest
    • The number of clusters is not known in advance
    • Different distance metrics or linkage methods need to be explored
    • The dataset is relatively small (due to higher computational complexity)
  • Choose k-means clustering when
    • The goal is to partition the data into a fixed number of clusters
    • The number of clusters (k) can be determined or estimated beforehand
    • Euclidean distance is a suitable measure for the data
    • The dataset is large (due to lower computational complexity compared to hierarchical clustering)

Visualizing Clustering Results

Dendrograms

  • A dendrogram is a tree-like diagram that represents the hierarchical structure of clusters obtained from hierarchical clustering
    • Shows the order and distance at which clusters are merged or split
    • The vertical axis represents the distance or dissimilarity between clusters
    • The horizontal axis represents the data points or clusters
    • The height of each branch represents the distance between the clusters being merged or split
  • Interpreting dendrograms
    • Clusters are formed by drawing a horizontal line at a chosen height and considering the vertical lines that intersect it
    • Lower heights indicate more similar clusters, while higher heights indicate more dissimilar clusters
    • The order of data points or clusters along the horizontal axis can reveal patterns or groupings
  • Cutting the dendrogram at different heights produces different numbers and configurations of clusters

Cluster Heatmaps

  • Cluster heatmaps visualize the clustering results along with the original data matrix
    • Rows and columns of the heatmap are reordered based on the clustering results
    • Each cell represents a data point, and the color intensity represents the value of the corresponding variable
    • Rows and columns are typically clustered using hierarchical clustering
    • Dendrograms are displayed alongside the heatmap to show the clustering structure
  • Interpreting cluster heatmaps
    • Blocks of similar colors indicate clusters or groups of data points with similar values
    • The dendrograms alongside the heatmap help identify the hierarchical relationships between clusters
    • Additional annotations, such as color bars or row/column labels, provide more information about the data and clustering results
  • Cluster heatmaps are useful for visualizing patterns, trends, and groups within high-dimensional datasets

Evaluating Clustering Quality

Internal Validation Measures

  • Silhouette analysis assesses the quality of clustering results by measuring how well each data point fits into its assigned cluster compared to other clusters
    • Silhouette coefficient ranges from -1 to 1, with higher values indicating better clustering quality
    • Calculates the average silhouette width for each cluster and the overall dataset
    • A high average silhouette width indicates good separation between clusters and cohesion within clusters
  • Other internal validation measures include
    • Davies-Bouldin index: Measures the ratio of within-cluster distances to between-cluster distances, with lower values indicating better clustering
    • Calinski-Harabasz index: Compares the between-cluster dispersion to the within-cluster dispersion, with higher values indicating better clustering
    • Dunn index: Measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance, with higher values indicating better clustering

Stability Evaluation

  • Cluster stability evaluates the consistency of clustering results across different runs or subsets of the data
    • Apply the clustering algorithm multiple times with different initializations or subsets of the data
    • Compare the resulting cluster assignments using measures such as the adjusted Rand index or normalized mutual information
    • Stable clusters should have high agreement across different runs
  • Bootstrapping or resampling techniques can be used to assess the stability of clustering results
    • Generate multiple bootstrap samples from the original dataset
    • Apply the clustering algorithm to each bootstrap sample
    • Evaluate the consistency of cluster assignments across the bootstrap samples

External Validation Measures

  • External validation measures compare the clustering results with known class labels or ground truth, if available
    • Adjusted Rand index: Measures the agreement between two partitions, accounting for chance, with values ranging from -1 to 1 (higher is better)
    • Normalized mutual information: Measures the mutual information between two partitions, normalized to account for cluster sizes, with values ranging from 0 to 1 (higher is better)
    • Purity: Measures the proportion of data points in each cluster that belong to the most common class, with values ranging from 0 to 1 (higher is better)
  • External validation measures are useful when the true class labels are known and the goal is to assess the accuracy of the clustering results

Clustering for Pattern Discovery

Data Preprocessing

  • Clustering can be applied to various types of data, including numerical, categorical, and mixed data types
    • Numerical data: Continuous or discrete values (e.g., age, income)
    • Categorical data: Nominal or ordinal variables (e.g., gender, education level)
    • Mixed data: Combination of numerical and categorical variables
  • Data preprocessing steps may be necessary to ensure compatibility with the clustering algorithms
    • Scaling: Normalize or standardize numerical variables to have similar ranges or distributions
    • Encoding: Convert categorical variables into numerical representations (e.g., one-hot encoding, label encoding)
    • Handling missing values: Remove or impute missing data points
    • Dimensionality reduction: Reduce the number of features using techniques like PCA or feature selection

Clustering Applications

  • Hierarchical clustering for identifying nested structures and relationships
    • Phylogenetic analysis: Cluster organisms based on genetic or morphological similarities to infer evolutionary relationships
    • Customer segmentation: Identify hierarchical groups of customers based on their purchasing behavior or demographics
    • Document clustering: Organize documents into a hierarchical structure based on their content or topics
  • K-means clustering for partitioning data into distinct groups
    • Market segmentation: Divide customers into segments based on their preferences, needs, or behaviors
    • Image compression: Group similar pixels or regions in an image to reduce storage space
    • Anomaly detection: Identify data points that do not belong to any cluster as potential anomalies or outliers
  • Combining clustering with dimensionality reduction techniques
    • Visualize high-dimensional data in lower-dimensional spaces (2D or 3D) while preserving the clustering structure
    • Techniques like PCA or t-SNE can be used to project the data onto a lower-dimensional space
    • The resulting visualization can help identify clusters, patterns, or separations in the data
  • Using clustering results for further analysis or modeling
    • Data exploration: Gain insights into the underlying structure and relationships within the data
    • Feature engineering: Use cluster assignments or distances as input features for classification or regression models
    • Recommender systems: Group similar users or items based on their preferences or behaviors to generate recommendations