Reputation: 23
I'm trying to implement a DBSCAN clustering test application using ELKI library. My dataset is 6-dimensional and is composed of around 100.000 objects.
I have tried to use the R*-Tree ELKI optimization inside my code, but benchmarking the code it seems that it still goes with O(n^2).
This is the code I use inside my application:
ListParameterization dbscanParams = new ListParameterization();
dbscanParams.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);
dbscanParams.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
dbscanParams.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
DBSCAN<DoubleVector, DoubleDistance> dbscan = ClassGenericsUtil.parameterizeOrAbort(DBSCAN.class, dbscanParams);
ArrayAdapterDatabaseConnection arrayAdapterDatabaseConnection = new ArrayAdapterDatabaseConnection(featuresMatrix, featuresLabels);
ListParameterization dbparams = new ListParameterization();
dbparams.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
dbparams.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
dbparams.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, arrayAdapterDatabaseConnection);
dbparams.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, pageSize);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, dbparams);
db.initialize();
Clustering<Model> result = dbscan.run(db);
Running the code above leads to these results:
| NUM_OBJECTS | TIME(ms) |
|-------------|------------|
| 4444 | 1508 |
| 8887 | 5547 |
| 17768 | 23401 |
| 35536 | 103733 |
| 71040 | 426494 |
| 142080 | 1801652 |
The time is benchmarked using a plain simple System.currentTimeMillis() around the dbscan.run(db). Looking at the times column, you can see that the trend is like n^2 and not like nlog(n), but I can't understand what I'm missing to use ELKI DBSCAN with R*-Tree optimization.
Thanks for any help or suggestions.
Upvotes: 1
Views: 1061
Reputation: 8715
If you choose the query radius epsilon too large, every object will have O(n)
neighbors.
Then the runtime will be O(n^2)
or worse, even with index support; because the answer size of each query is O(n)
.
If you choose epsilon such that on average 10% of the objects will be within the epsilon radius, then your runtime will be at least O(n * 10% * n)
, which is O(n^2)
.
Which shows nicely how a theoretical runtime of O(n log n)
may not give you a runtime of O(n log n)
in practice. A R*-tree can answer radius or kNN queries in O(log n)
on average - for small answer sets, where the answer set size can be neglected. A more precise analysis would likely yield a runtime of O(log n + |answer| log |answer|)
(because we sort the answers by distance currently; we could remove this for some algorithms).
More often than not, an algorithm assumed to be O(n*n)
will cost you O(n*n log n)
runtime, because for every object you sort all others by distance. Fortunately, sorting is well optimized, so the extra log n
does not matter much.
Upvotes: 1