Reputation: 1620
I am running DBSCAN algorithm in Python on a dataset (modelled very similar to http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html and loaded as a pandas dataframe) that has a total of ~ 3 million datapoints, across 31 days. Further, I do density clustering to find outliers on a per day basis, so
db = DBSCAN(eps=0.3, min_samples=10).fit(data)
will just have a day worth of data-points to run on, in each pass. The minimum/maximum points I have on any day is 15809 & 182416. I tried deleting the variables, but the process gets killed at the DBSCAN clustering stage.
At O(n log n)
this obviously bloats up, no matter where I run it. I understand there is no way to pre-specify the number of "labels", or clusters - what else is a the best here?
Also, from an optimization point of view, some of the values of these data points will be exact (think of these as cluster points that are repeated) - can I use this information to process the data ahead of feeding to DBSCAN?
I read this thread on using "canopy preclustering" to compress your data as in vector quantization ahead of DBSCAN (Note this method is equally expensive computationally) - can I use something similar to pre-process my data? Or how about "parallel DBSCAN"?
Upvotes: 1
Views: 1658
Reputation: 77495
Have you considered to do:
Upvotes: 1