ELKI DBSCAN with R*-Tree

Question

I'm trying to implement a DBSCAN clustering test application using ELKI library. My dataset is 6-dimensional and is composed of around 100.000 objects.

I have tried to use the R*-Tree ELKI optimization inside my code, but benchmarking the code it seems that it still goes with O(n^2).

This is the code I use inside my application:

ListParameterization dbscanParams = new ListParameterization();
dbscanParams.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);
dbscanParams.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
dbscanParams.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);

DBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(DBSCAN.class, dbscanParams);

ArrayAdapterDatabaseConnection arrayAdapterDatabaseConnection = new ArrayAdapterDatabaseConnection(featuresMatrix, featuresLabels);

ListParameterization dbparams = new ListParameterization();
dbparams.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
dbparams.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
dbparams.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, arrayAdapterDatabaseConnection);
dbparams.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, pageSize);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, dbparams);

db.initialize();

Clustering result = dbscan.run(db);

Running the code above leads to these results:

| NUM_OBJECTS |  TIME(ms)  |
|-------------|------------|
| 4444        |  1508      |
| 8887        |  5547      |
| 17768       |  23401     |
| 35536       |  103733    |
| 71040       |  426494    |
| 142080      |  1801652   |

The time is benchmarked using a plain simple System.currentTimeMillis() around the dbscan.run(db). Looking at the times column, you can see that the trend is like n^2 and not like nlog(n), but I can't understand what I'm missing to use ELKI DBSCAN with R*-Tree optimization.

Thanks for any help or suggestions.

Erich Schubert · Accepted Answer

If you choose the query radius epsilon too large, every object will have O(n) neighbors.

Then the runtime will be O(n^2) or worse, even with index support; because the answer size of each query is O(n).

If you choose epsilon such that on average 10% of the objects will be within the epsilon radius, then your runtime will be at least O(n * 10% * n), which is O(n^2).

Which shows nicely how a theoretical runtime of O(n log n) may not give you a runtime of O(n log n) in practice. A R*-tree can answer radius or kNN queries in O(log n) on average - for small answer sets, where the answer set size can be neglected. A more precise analysis would likely yield a runtime of O(log n + |answer| log |answer|) (because we sort the answers by distance currently; we could remove this for some algorithms).

More often than not, an algorithm assumed to be O(n*n) will cost you O(n*n log n) runtime, because for every object you sort all others by distance. Fortunately, sorting is well optimized, so the extra log n does not matter much.

ELKI DBSCAN with R*-Tree

Answers (1)

Related Questions