📉Statistical Methods for Data Science Unit 12 Review

12.2 Hierarchical Clustering

📉Statistical Methods for Data Science
Unit 12 Review

12.2 Hierarchical Clustering

Written by the Fiveable Content Team • Last updated September 2025

📉Statistical Methods for Data Science

Unit & Topic Study Guides

12.1 K-means Clustering

12.2 Hierarchical Clustering

12.3 Density-based Clustering

12.4 Cluster Validation and Interpretation

Hierarchical clustering organizes data into a tree-like structure, revealing relationships between points. It's a versatile method that can uncover hidden patterns and group similar items together, making it useful for various fields like biology and marketing.

This approach offers two main types: agglomerative (bottom-up) and divisive (top-down). By using different linkage methods, it can adapt to various data structures and reveal insights about the underlying relationships in your dataset.

Types of Hierarchical Clustering

Agglomerative and Divisive Clustering

Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters until all points belong to a single cluster
- Also known as a "bottom-up" approach since it begins with individual data points and builds up to a single cluster
- At each step, the two closest clusters are combined into a new cluster
- The process continues until a desired number of clusters is reached or all data points are in one cluster
Divisive clustering begins with all data points in a single cluster and recursively splits the clusters until each data point is in its own cluster
- Follows a "top-down" approach, starting with a single cluster containing all data and dividing it into smaller clusters
- At each step, the largest cluster is split into two smaller clusters based on a chosen criterion
- The splitting process continues until each data point is in its own cluster or a desired number of clusters is achieved

Dendrograms

A dendrogram is a tree-like diagram used to visualize the arrangement of clusters produced by hierarchical clustering
- The x-axis represents the data points, while the y-axis represents the distance or dissimilarity between clusters
- Each merge or split is represented by a horizontal line connecting the clusters
- The height of the horizontal line indicates the distance between the merged or split clusters
Dendrograms allow for easy interpretation of the clustering results
- The closer two data points or clusters are connected in the dendrogram, the more similar they are
- Cutting the dendrogram at a specific height (distance threshold) determines the final number of clusters
- Example: In a dendrogram of animal species, closely related species (cats and tigers) will be connected at a lower height compared to more distantly related species (cats and elephants)

Linkage Methods

Distance-based Linkage Methods

Single linkage determines the distance between two clusters as the minimum distance between any two points in the clusters
- Also known as the nearest neighbor method
- Tends to create long, chain-like clusters and can be sensitive to noise and outliers
- Example: In a dataset of cities, single linkage would consider the distance between two clusters of cities as the shortest distance between any pair of cities from each cluster
Complete linkage calculates the distance between two clusters as the maximum distance between any two points in the clusters
- Also referred to as the farthest neighbor method
- Tends to create compact, tightly-bound clusters and is less susceptible to noise and outliers compared to single linkage
- Example: In a dataset of animal species, complete linkage would consider the distance between two clusters of species as the largest distance between any pair of species from each cluster
Average linkage computes the distance between two clusters as the average distance between all pairs of points in the clusters
- Strikes a balance between single and complete linkage methods
- Less affected by noise and outliers compared to single linkage, but may not create clusters as compact as complete linkage
- Example: In a dataset of customer preferences, average linkage would calculate the distance between two clusters of customers by taking the mean of all pairwise distances between customers from each cluster

Variance-based Linkage Methods

Ward's method aims to minimize the total within-cluster variance when merging clusters
- At each step, the merge that results in the smallest increase in total within-cluster variance is chosen
- Tends to create clusters of similar sizes and shapes
- Example: In a dataset of stock prices, Ward's method would merge clusters of stocks in a way that minimizes the overall variance within each cluster

Evaluating Linkage Methods

Cophenetic correlation measures the correlation between the original pairwise distances and the distances obtained from the dendrogram
- A high cophenetic correlation (close to 1) indicates that the dendrogram accurately represents the original distances between data points
- Helps in assessing the quality of the clustering results and comparing different linkage methods
- Example: If the cophenetic correlation is 0.9, it suggests that the dendrogram preserves 90% of the original pairwise distances, indicating a good fit

📉Statistical Methods for Data Science Unit 12 Review

12.2 Hierarchical Clustering

📉Statistical Methods for Data Science
Unit 12 Review

12.2 Hierarchical Clustering

Unit & Topic Study Guides

Types of Hierarchical Clustering

Agglomerative and Divisive Clustering

Dendrograms

Linkage Methods

Distance-based Linkage Methods

Variance-based Linkage Methods

Evaluating Linkage Methods

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes