N

Visual Guide to DBSCAN Clustering

12 Mar 2017

Clustering has become an everyday process for grouping together observations based on similar factors. This is particularly true when working with spatial data. For some of my ongoing research into applying spatial Statistics to fluorescence microscopy, I’ve been applying DBSCAN to binary images of fluorescence-tagged chromosomes to localize chromosomes. The Scikit Learn Pnython library provides a blisteringly fast DBSCAN implementation that can cluster 78 million observations in 6 seconds.

Figure 1: Real time DBSCAN clustering of two sets of normally distributed points in a field of noise. A JS implementation of DBSCAN classified sets of two-dimensional coordinates as being either noise or one of two (or more) clusters. As a general warning, the data used for this example are randomly generated on page load, so it's possible to identify more than two clusters in this dataset due to the non-deterministic nature of both the data and DBSCAN.

As I continued working with the algorithm, I started to think that it would be interesting to see the process unfold step by step for a set of data. To that end, I’ve created an annotated step-by-step guide to how DCSCAN clusters data.

For this guide, the data points are pseudo-randomly generated using three distributions, hereby denoted $X_1$, $X_2$, and $X_3$. These random variables are all two-dimensional, with two normal distributions and a single uniform distribution in total. We can also describe these distributions as the following:

\(X_1 \sim (\mathcal{N}(0, 0.075), \mathcal{N}(0, 0.075))\) \(X_2 \sim (\mathcal{N}(0.375, 0.075), \mathcal{N}(0.5, 0.075))\) \(X_3 \sim (\mathcal{U}(-0.25, 0.55), \mathcal{U}(-0.2, 0.65))\)

For the randomly generated data set you are seeing, we’ve sampled 70 observations from $X_1$, 25 observations from $X_2$, and 10 observations from $X_3$. With the math bit out-of-the-way, let’s get to the good stuff.

Before starting the clustering process, DBSCAN requires two parameters: $\epsilon$, which is the greatest distance between points, and minPts, which is the fewest neighbors required within a distance $\epsilon$ required to consider the point as a core point.

The DBSCAN process starts by selecting a single observation in your data set. If it’s not been visited before (it hasn’t), then counting the number of neighbors within the distance $\epsilon$. If the number of neighbors is greater than minPts, we will classify the point as a core point, and we will use it to expand a cluster. If the number of neighbors is less than minPts, then we condsider the point as noise.

Figure 2: Initial exploration of a point. This figure identifies an initial point. We start by randomly selecting a point from the dataset. We verify that we've not already explored this point, and if we have not, we determine if the point is a core point.

For each previously-unvisted neighbor, we assign it to the cluster, and then use it as a point for expanding the cluster further. During this expansion phase, we start by finding all the neighbor’s neighbors (second-degree neighbors) within a distance $\epsilon$. If the number of second-degree neighbors is less then minPts, we stop our expansion and go back to iterating over the original first-degree neighbors. If the number of second-degree neighbors is at least minPts, then the expanding point is also a core point which we will use for continued expansion.

Figure 3: Cluster expansion around a core point. Since our initial exploratory point has more neighbors than the minimum number of points per cluster, we consider it a core point. We then find all neighbors within a distance $\epsilon$ (the colored ellipse) to add to the cluster, and to use as launching points for continued cluster expansion.

When a new core point is found, we add all the core point’s neighbors (second-degree neighbors) to the list of neighbors to explore and we continue iterating through the first-degree neighbors. We then continue this process of iterating through neighbors and expanding around core points until we have exhausted all unvisited neighbors. This incremental expansion of the cluster allows us to cluster data spatial without worrying about the structure of each cluster. We could, hypothetically, have two spirals intertwined and still be able to classify them as separate clusters.

Since we explore each point in a variable order, we must consider this clustering process is as non-deterministic. Due to this variability, performing DBCSCAN on a dataset may result in different clustering results for each execution. We can mitigate this issue by performing multiple clustering operations, and by tuning the clustering parameters minPts and $\epsilon$.

Figure 4: Identification of new core points and continued cluster expansion around a core point. As we continue exploring our core point's neighbors, we check each neighbor to see if it qualifies as a core point. If it does, then we identify all of that point's neighbors and explore them too.

Eventually, we will have explored all first-degree neighbors. When this occurs, we can pick a new unexplored point and start the process again.

Figure 5: Core point and neighbor exhaustion. When we run out of neighbors and core points to expand out from, we have completed the classification of the cluster. We then pick a new point and start the process again.

By repeating these 5 steps until we have explored all points, we are able to cluster all the points into specific clusters, and into noise.

Figure 6: Repeat. We then repeat this core point identification and cluster expansion processes until we have visited all points visited. In our case, we are likely to see two clusters (orange and green), with some points having been classified as noise (blue). If your random seed was lucky when the page loaded, you may see many of the colorful clusters that DBSCAN identified.

One important characteristic to note about DBSCAN is that this specific clustering algorithm does not assume anything about the number of clusters or their constituents. The only variable DBSCAN accounts for is, in essence, density. This contrasts with algorithms like CLARANS and $k$-means in that we, as data scientists, can not assume that a data set has $k$ clusters to find. While this may limit the usefulness of DBSCAN in situations like market segmentation and binning, DBSCAN really shines when our spatial data are messy.

butwhy

Figure 7: Why DBSCAN? I don't really have figure text for this one. I just thought it would be funny to have a figure caption for this gif. ¯\_(ツ)_/¯

Unfortunately for us, messy data are everywhere. Real world data are full of variability and complex structures are typically lost when binning using a partition-based clustering algorithm.

In my own personal research, there is no guarantee that all the localized chromosomes will be perfectly contained to specific regions of the cell. Instead, they group all along the outside of the cell with thousands of chromosomes escaping due to damaged caused to the cell membrane during the tagging process.

There is one major draw-back to DBSCAN, however, that makes it an imperfect universal clustering algorithm for spatial applications. Density. DBSCAN relies on density so heavily that it is unable to cluster data with radically different densities.

Figure 8: DBSCAN does not perform well for data with inconsistent density. This figure has the same number of points as the originals, but in this case the distribution of $X_2$ has been modified to have a variance of 0.12, an increase of 60%. Due to this decrease in density, DBSCAN is more likely to mis-categorize as noise.

To cluster spatial data while taking into account density, you may need DBSCAN’s sibling: OPTICS.