Visual Guide to DBSCAN Clustering
Clustering has become an everyday process for grouping together observations based on similar factors. This is particularly true when working with spatial data. For some of my ongoing research into applying spatial Statistics to fluorescence microscopy, I’ve been applying DBSCAN to binary images of fluorescence-tagged chromosomes to localize chromosomes. The Scikit Learn Pnython library provides a blisteringly fast DBSCAN implementation that can cluster 78 million observations in 6 seconds.
As I continued working with the algorithm, I started to think that it would be interesting to see the process unfold step by step for a set of data. To that end, I’ve created an annotated step-by-step guide to how DCSCAN clusters data.
For this guide, the data points are pseudo-randomly generated using three distributions, hereby denoted $X_1$, $X_2$, and $X_3$. These random variables are all two-dimensional, with two normal distributions and a single uniform distribution in total. We can also describe these distributions as the following:
\(X_1 \sim (\mathcal{N}(0, 0.075), \mathcal{N}(0, 0.075))\) \(X_2 \sim (\mathcal{N}(0.375, 0.075), \mathcal{N}(0.5, 0.075))\) \(X_3 \sim (\mathcal{U}(-0.25, 0.55), \mathcal{U}(-0.2, 0.65))\)
For the randomly generated data set you are seeing, we’ve sampled 70 observations from $X_1$, 25 observations from $X_2$, and 10 observations from $X_3$. With the math bit out-of-the-way, let’s get to the good stuff.
Before starting the clustering process, DBSCAN requires two parameters: $\epsilon$, which is the greatest distance between points, and minPts, which is the fewest neighbors required within a distance $\epsilon$ required to consider the point as a core point.
The DBSCAN process starts by selecting a single observation in your data set. If it’s not been visited before (it hasn’t), then counting the number of neighbors within the distance $\epsilon$. If the number of neighbors is greater than minPts, we will classify the point as a core point, and we will use it to expand a cluster. If the number of neighbors is less than minPts, then we condsider the point as noise.
For each previously-unvisted neighbor, we assign it to the cluster, and then use it as a point for expanding the cluster further. During this expansion phase, we start by finding all the neighbor’s neighbors (second-degree neighbors) within a distance $\epsilon$. If the number of second-degree neighbors is less then minPts, we stop our expansion and go back to iterating over the original first-degree neighbors. If the number of second-degree neighbors is at least minPts, then the expanding point is also a core point which we will use for continued expansion.
When a new core point is found, we add all the core point’s neighbors (second-degree neighbors) to the list of neighbors to explore and we continue iterating through the first-degree neighbors. We then continue this process of iterating through neighbors and expanding around core points until we have exhausted all unvisited neighbors. This incremental expansion of the cluster allows us to cluster data spatial without worrying about the structure of each cluster. We could, hypothetically, have two spirals intertwined and still be able to classify them as separate clusters.
Since we explore each point in a variable order, we must consider this clustering process is as non-deterministic. Due to this variability, performing DBCSCAN on a dataset may result in different clustering results for each execution. We can mitigate this issue by performing multiple clustering operations, and by tuning the clustering parameters minPts and $\epsilon$.
Eventually, we will have explored all first-degree neighbors. When this occurs, we can pick a new unexplored point and start the process again.
By repeating these 5 steps until we have explored all points, we are able to cluster all the points into specific clusters, and into noise.
One important characteristic to note about DBSCAN is that this specific clustering algorithm does not assume anything about the number of clusters or their constituents. The only variable DBSCAN accounts for is, in essence, density. This contrasts with algorithms like CLARANS and $k$-means in that we, as data scientists, can not assume that a data set has $k$ clusters to find. While this may limit the usefulness of DBSCAN in situations like market segmentation and binning, DBSCAN really shines when our spatial data are messy.
Unfortunately for us, messy data are everywhere. Real world data are full of variability and complex structures are typically lost when binning using a partition-based clustering algorithm.
In my own personal research, there is no guarantee that all the localized chromosomes will be perfectly contained to specific regions of the cell. Instead, they group all along the outside of the cell with thousands of chromosomes escaping due to damaged caused to the cell membrane during the tagging process.
There is one major draw-back to DBSCAN, however, that makes it an imperfect universal clustering algorithm for spatial applications. Density. DBSCAN relies on density so heavily that it is unable to cluster data with radically different densities.
To cluster spatial data while taking into account density, you may need DBSCAN’s sibling: OPTICS.