Reputation: 463
Question: The best way to find out the Eps and MinPts parameters for DBSCAN algorithm?
Problem: The goal is to find the locations (clusters) based on coordinates (input data). The algorithm calculates the most visited areas and retrieves these clusters.
Approach:
I defined the epsilon (EPS) parameter as 1.5 km - converted to radians to be used by the DBSCAN algorithm: epsilon = 1.5 / 6371.0088
(ref to this 1.5 km: https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/).
If I define the MinPts to a low value (e.g. MinPts = 5, it will produce 2000 clusters), the DBSCAN will produce too many clusters and I want to limit the relevance/size of the clusters to an acceptable value. I use the haversine metric and ball tree algorithm to calculate great-circle distances between points.
Suggestions:
Data: I'm using 160k coordinates but the program should be capable to handle different data inputs.
Upvotes: 2
Views: 3172
Reputation: 13129
As you may know, setting MinPts
high will not only prevent small clusters from forming, but will also change the shape of larger clusters as its outskirts will be considered outliers.
Consider instead a third way to reduce the number of clusters; simply sort by descending size (number of coordinates) and limit that to 4 or 5. This way, you won't be shown all the small clusters if you're not interested in them, but you can instead treat all those points as noise.
You're essentially using DBSCAN for something it's not meant for, namely to find the n
largest clusters, but that's fine - you just need to "tweak the algorithm" to fit your use case.
Update
If you know the entire dataset and it will not change in the future, I would just tweak minPts
manually, based on your knowledge.
In scientific environments and with varying data sets, you consider the data as "generated from a stochastic process". However, that would mean that there is a chance - no matter how small - that there are minPts
dogs in a remote forest somewhere at the same time, or minPts - 1
dogs in Central Park, where it's normally overcrowded.
What I mean by that is that if you go down the scientific road, you need to find a balance between the deterministic value of minPts
and the probabilistic distribution of the points in your data set.
In my experience, it all comes down to whether or not you trust your knowledge, or would like to defer responsibility. In some government/scientific/large corporate positions, it's a safer choice to pin something on an algorithm than on gut feeling. In other situations, it's safe to use gut feeling.
Upvotes: 2