ekta
ekta

Reputation: 1620

Optimizing a DBSCAN to run computationally

I am running DBSCAN algorithm in Python on a dataset (modelled very similar to http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html and loaded as a pandas dataframe) that has a total of ~ 3 million datapoints, across 31 days. Further, I do density clustering to find outliers on a per day basis, so db = DBSCAN(eps=0.3, min_samples=10).fit(data) will just have a day worth of data-points to run on, in each pass. The minimum/maximum points I have on any day is 15809 & 182416. I tried deleting the variables, but the process gets killed at the DBSCAN clustering stage.

  1. At O(n log n) this obviously bloats up, no matter where I run it. I understand there is no way to pre-specify the number of "labels", or clusters - what else is a the best here?

  2. Also, from an optimization point of view, some of the values of these data points will be exact (think of these as cluster points that are repeated) - can I use this information to process the data ahead of feeding to DBSCAN?

  3. I read this thread on using "canopy preclustering" to compress your data as in vector quantization ahead of DBSCAN (Note this method is equally expensive computationally) - can I use something similar to pre-process my data? Or how about "parallel DBSCAN"?

Upvotes: 1

Views: 1658

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

Have you considered to do:

  • partitioning, cluster one day (or less) at a time
  • sampling, break your data set randomly into 10 parts. process them individually

Upvotes: 1

Related Questions