Fiveable

๐Ÿ“‰Statistical Methods for Data Science Unit 12 Review

QR code for Statistical Methods for Data Science practice questions

12.2 Hierarchical Clustering

๐Ÿ“‰Statistical Methods for Data Science
Unit 12 Review

12.2 Hierarchical Clustering

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“‰Statistical Methods for Data Science
Unit & Topic Study Guides

Hierarchical clustering organizes data into a tree-like structure, revealing relationships between points. It's a versatile method that can uncover hidden patterns and group similar items together, making it useful for various fields like biology and marketing.

This approach offers two main types: agglomerative (bottom-up) and divisive (top-down). By using different linkage methods, it can adapt to various data structures and reveal insights about the underlying relationships in your dataset.

Types of Hierarchical Clustering

Agglomerative and Divisive Clustering

  • Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters until all points belong to a single cluster
    • Also known as a "bottom-up" approach since it begins with individual data points and builds up to a single cluster
    • At each step, the two closest clusters are combined into a new cluster
    • The process continues until a desired number of clusters is reached or all data points are in one cluster
  • Divisive clustering begins with all data points in a single cluster and recursively splits the clusters until each data point is in its own cluster
    • Follows a "top-down" approach, starting with a single cluster containing all data and dividing it into smaller clusters
    • At each step, the largest cluster is split into two smaller clusters based on a chosen criterion
    • The splitting process continues until each data point is in its own cluster or a desired number of clusters is achieved

Dendrograms

  • A dendrogram is a tree-like diagram used to visualize the arrangement of clusters produced by hierarchical clustering
    • The x-axis represents the data points, while the y-axis represents the distance or dissimilarity between clusters
    • Each merge or split is represented by a horizontal line connecting the clusters
    • The height of the horizontal line indicates the distance between the merged or split clusters
  • Dendrograms allow for easy interpretation of the clustering results
    • The closer two data points or clusters are connected in the dendrogram, the more similar they are
    • Cutting the dendrogram at a specific height (distance threshold) determines the final number of clusters
    • Example: In a dendrogram of animal species, closely related species (cats and tigers) will be connected at a lower height compared to more distantly related species (cats and elephants)

Linkage Methods

Distance-based Linkage Methods

  • Single linkage determines the distance between two clusters as the minimum distance between any two points in the clusters
    • Also known as the nearest neighbor method
    • Tends to create long, chain-like clusters and can be sensitive to noise and outliers
    • Example: In a dataset of cities, single linkage would consider the distance between two clusters of cities as the shortest distance between any pair of cities from each cluster
  • Complete linkage calculates the distance between two clusters as the maximum distance between any two points in the clusters
    • Also referred to as the farthest neighbor method
    • Tends to create compact, tightly-bound clusters and is less susceptible to noise and outliers compared to single linkage
    • Example: In a dataset of animal species, complete linkage would consider the distance between two clusters of species as the largest distance between any pair of species from each cluster
  • Average linkage computes the distance between two clusters as the average distance between all pairs of points in the clusters
    • Strikes a balance between single and complete linkage methods
    • Less affected by noise and outliers compared to single linkage, but may not create clusters as compact as complete linkage
    • Example: In a dataset of customer preferences, average linkage would calculate the distance between two clusters of customers by taking the mean of all pairwise distances between customers from each cluster

Variance-based Linkage Methods

  • Ward's method aims to minimize the total within-cluster variance when merging clusters
    • At each step, the merge that results in the smallest increase in total within-cluster variance is chosen
    • Tends to create clusters of similar sizes and shapes
    • Example: In a dataset of stock prices, Ward's method would merge clusters of stocks in a way that minimizes the overall variance within each cluster

Evaluating Linkage Methods

  • Cophenetic correlation measures the correlation between the original pairwise distances and the distances obtained from the dendrogram
    • A high cophenetic correlation (close to 1) indicates that the dendrogram accurately represents the original distances between data points
    • Helps in assessing the quality of the clustering results and comparing different linkage methods
    • Example: If the cophenetic correlation is 0.9, it suggests that the dendrogram preserves 90% of the original pairwise distances, indicating a good fit