Fiveable

๐Ÿ“‰Statistical Methods for Data Science Unit 12 Review

QR code for Statistical Methods for Data Science practice questions

12.3 Density-based Clustering

๐Ÿ“‰Statistical Methods for Data Science
Unit 12 Review

12.3 Density-based Clustering

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“‰Statistical Methods for Data Science
Unit & Topic Study Guides

Density-based clustering finds groups of closely packed points, separating them from low-density areas. DBSCAN, a popular algorithm, doesn't need you to specify the number of clusters beforehand and can find clusters of any shape.

DBSCAN uses two key parameters: Eps (maximum neighbor distance) and MinPts (minimum points for a dense region). It identifies core, border, and noise points, forming clusters based on density-reachability and connectivity between points.

DBSCAN Algorithm

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

  • DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together and marks points as outliers that lie in low-density regions
  • Clusters are defined as areas of high density separated by areas of low density
  • Does not require specifying the number of clusters in advance like k-means clustering
  • Can discover clusters of arbitrary shape and is not limited to discovering spherical clusters

Key Parameters and Concepts

  • Eps (epsilon) is the maximum radius of the neighborhood around a point $x$
    • Determines the maximum distance between two points for them to be considered neighbors
  • MinPts is the minimum number of points required to form a dense region
    • Determines the minimum number of points required in the Eps neighborhood of a point for it to be considered a core point
  • Core point is a point that has at least MinPts points within its Eps neighborhood, including itself
  • Border point is a point that has fewer than MinPts points within its Eps neighborhood, but is reachable from some core point
  • Noise point is any point that is neither a core point nor a border point
    • Noise points do not belong to any cluster and are treated as outliers

Density Concepts

Reachability and Connectivity

  • Directly density-reachable: A point $p$ is directly density-reachable from a point $q$ if $p$ is within the Eps neighborhood of $q$ and $q$ is a core point
  • Density-reachable: A point $p$ is density-reachable from a point $q$ if there is a chain of points $p_1, \dots, p_n$, where $p_1 = q$, $p_n = p$ such that $p_{i+1}$ is directly density-reachable from $p_i$ for all $i$
    • Density-reachability is not symmetric (e.g., a border point may be density-reachable from a core point, but not vice versa)
  • Density-connected: Two points $p$ and $q$ are density-connected if there is a point $o$ such that both $p$ and $q$ are density-reachable from $o$
    • Density-connectivity is symmetric (e.g., if $p$ is density-connected to $q$, then $q$ is also density-connected to $p$)

Cluster Formation

  • A cluster in DBSCAN is defined as a set of density-connected points that is maximal with respect to density-reachability
  • Points within a cluster are mutually density-connected and density-reachable from at least one core point in the cluster
  • Different clusters are separated by regions of low density, where points are not density-reachable from any core point in the clusters

Extensions

Ordering Points to Identify the Clustering Structure (OPTICS)

  • OPTICS is an extension of DBSCAN that creates a reachability plot, which is a visual representation of the clustering structure
  • Reachability plot displays the reachability distance for each point, which represents the minimum distance for a point to be density-reachable from a higher density point
  • Clusters appear as valleys or dents in the reachability plot, with deeper valleys indicating denser clusters
  • Provides a hierarchical clustering structure without explicitly producing a data set clustering
  • Allows for extracting clusters at different density levels (e.g., by varying the Eps parameter) from the reachability plot