Paul
Paul

Reputation: 1355

clustering large data with ELKI

I'm using ELKI's AnderbergHierarchicalClustering for my data set with over 150000 observations and for each observation, I'm using three variables: lat, lng andprice and all of them are double.

I've got following problems:

In order to solve this problem, I decided to split the data set into overlapping subsets of 20000 obs.

For 20000 obs I shall need ~4.8GB RAM.

I don't know what would be the best approach for splitting the data in such a way, that the cluster result applied on the subsets will be as close as possible to the result of clustering the whole set.

Upvotes: 0

Views: 190

Answers (1)

Erich Schubert
Erich Schubert

Reputation: 8715

If you use single-linkage, you can use SLINK which only needs O(n) memory.

It will still need O(n^2) time.

Hierarchical clustering is not very scalable.

CLINK can do complete linkage with O(n) memory, but the result quality is usually not very good (usually worse than Anderberg with complete linkage).

All the other algorithms we have are O(n^2) in memory, unfortunately.

Thus at around 65535 instances you will hit a wall with Java.

I have one algorithm on my todo list that should be able to run in O(n log n) if I'm not mistaken with index support. But I did not get around to study it in detail yet - it's not clear which linkage strategies it can support besides single-linkage.

If you have a lot of duplicates, make sure to merge them first!

Upvotes: 1

Related Questions