Optimizing a DBSCAN to run computationally

Question

I am running DBSCAN algorithm in Python on a dataset (modelled very similar to http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html and loaded as a pandas dataframe) that has a total of ~ 3 million datapoints, across 31 days. Further, I do density clustering to find outliers on a per day basis, so db = DBSCAN(eps=0.3, min_samples=10).fit(data) will just have a day worth of data-points to run on, in each pass. The minimum/maximum points I have on any day is 15809 & 182416. I tried deleting the variables, but the process gets killed at the DBSCAN clustering stage.

At O(n log n) this obviously bloats up, no matter where I run it. I understand there is no way to pre-specify the number of "labels", or clusters - what else is a the best here?
Also, from an optimization point of view, some of the values of these data points will be exact (think of these as cluster points that are repeated) - can I use this information to process the data ahead of feeding to DBSCAN?
I read this thread on using "canopy preclustering" to compress your data as in vector quantization ahead of DBSCAN (Note this method is equally expensive computationally) - can I use something similar to pre-process my data? Or how about "parallel DBSCAN"?

Optimizing a DBSCAN to run computationally

Answers (1)

Related Questions