Say you are studying a population of dogs, made up of majority pure breed dogs. You want to separate the population into dog breeds, but you don’t know how many different breeds exist in the population. You also know that there are some mutts in the population which have weird features. These represent noisy data in your population.
You take several measurements of the dogs, including height, fluffiness, and muzzle roundness. Assume that fluffiness and muzzle roundness are ratings between 1 and 100. You can define a three dimensional space made up of your three features of height, fluffiness and muzzle roundness, marking each dog as a point in that space, and go looking through that space.
Density-Based Spatial Clustering of Applications with Noise, a.k.a DBSCAN, is a clustering method applicable to numerical data that, as the name suggests, can handle noise in the data.
There are two parameters you need to choose – a search radius (eps) and a minimum number of data points k.
Then, you randomly choose a dog to be the first member of a cluster. You look at a circle of radius eps around the dog, to see if there are a minimum of k dogs within that radius. In our case of a three dimensional space, you look at the sphere of radius eps. This can also be done in higher dimensions mathematically.
If there are k or more dogs in radius eps, then you mark them as all belonging to your new cluster, and take each point in and repeat until you can’t go any further. If not, then that dog is marked as noise, and is likely a non-pure bred dog.
You then start again, randomly choosing a dog to be the first member, but this dog cannot belong to your previous cluster(s). Repeat until all dogs have been assigned.
Note – you can use the knowledge of an expert in dogs to advise you what the best values of k and eps are, otherwise you can use trial and error.
A more detailed description of steps follows below.
User Defined Parameters
- Minimum data points (MinPts / k)
- Radius of the neighbourhood (eps)
Domain knowledge is useful in selecting these parameters.
Steps
- Choose your user defined parameters (k, eps)
- Define N as a candidate set, which we will add items into for inspection, to choose whether we add them to a cluster C.
- Consider all data samples as initially marked unvisited
- Choose a data sample randomly from the unvisited data samples, and mark as visited
- For this sample:
- Check within its radius eps whether it contains at least k data samples. If not, mark as a noise point, and start again at 4.
- Otherwise, add it to the cluster C
- For each data sample found within its radius eps:
- If it doesn’t belong to any other cluster:
- add it to cluster C
- mark it as visited
- Check within the radius eps of this sample whether it contains at least k data samples. If so, recursively apply 5.3 to this point
- If it doesn’t belong to any other cluster:
- At this point cluster C is obtained. Return to 4 to start looking for a new cluster, until all points are marked visited