Density-based clustering finds groups of closely packed points, separating them from low-density areas. DBSCAN, a popular algorithm, doesn't need you to specify the number of clusters beforehand and can find clusters of any shape.
DBSCAN uses two key parameters: Eps (maximum neighbor distance) and MinPts (minimum points for a dense region). It identifies core, border, and noise points, forming clusters based on density-reachability and connectivity between points.
DBSCAN Algorithm
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
- DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together and marks points as outliers that lie in low-density regions
- Clusters are defined as areas of high density separated by areas of low density
- Does not require specifying the number of clusters in advance like k-means clustering
- Can discover clusters of arbitrary shape and is not limited to discovering spherical clusters
Key Parameters and Concepts
- Eps (epsilon) is the maximum radius of the neighborhood around a point $x$
- Determines the maximum distance between two points for them to be considered neighbors
- MinPts is the minimum number of points required to form a dense region
- Determines the minimum number of points required in the Eps neighborhood of a point for it to be considered a core point
- Core point is a point that has at least MinPts points within its Eps neighborhood, including itself
- Border point is a point that has fewer than MinPts points within its Eps neighborhood, but is reachable from some core point
- Noise point is any point that is neither a core point nor a border point
- Noise points do not belong to any cluster and are treated as outliers
Density Concepts
Reachability and Connectivity
- Directly density-reachable: A point $p$ is directly density-reachable from a point $q$ if $p$ is within the Eps neighborhood of $q$ and $q$ is a core point
- Density-reachable: A point $p$ is density-reachable from a point $q$ if there is a chain of points $p_1, \dots, p_n$, where $p_1 = q$, $p_n = p$ such that $p_{i+1}$ is directly density-reachable from $p_i$ for all $i$
- Density-reachability is not symmetric (e.g., a border point may be density-reachable from a core point, but not vice versa)
- Density-connected: Two points $p$ and $q$ are density-connected if there is a point $o$ such that both $p$ and $q$ are density-reachable from $o$
- Density-connectivity is symmetric (e.g., if $p$ is density-connected to $q$, then $q$ is also density-connected to $p$)
Cluster Formation
- A cluster in DBSCAN is defined as a set of density-connected points that is maximal with respect to density-reachability
- Points within a cluster are mutually density-connected and density-reachable from at least one core point in the cluster
- Different clusters are separated by regions of low density, where points are not density-reachable from any core point in the clusters
Extensions
Ordering Points to Identify the Clustering Structure (OPTICS)
- OPTICS is an extension of DBSCAN that creates a reachability plot, which is a visual representation of the clustering structure
- Reachability plot displays the reachability distance for each point, which represents the minimum distance for a point to be density-reachable from a higher density point
- Clusters appear as valleys or dents in the reachability plot, with deeper valleys indicating denser clusters
- Provides a hierarchical clustering structure without explicitly producing a data set clustering
- Allows for extracting clusters at different density levels (e.g., by varying the Eps parameter) from the reachability plot