Reputation: 1355
I'm using ELKI's AnderbergHierarchicalClustering
for my data set with over 150000
observations and for each observation, I'm using three variables: lat
, lng
andprice
and all of them are double
.
I've got following problems:
right shift
for the Agnes triangle
- (size * (size - 1)) >>> 1
- this comes to large RAM
needsIn order to solve this problem, I decided to split the data set into overlapping subsets of 20000 obs
.
For 20000 obs
I shall need ~4.8GB RAM
.
I don't know what would be the best approach for splitting the data in such a way, that the cluster result applied on the subsets will be as close as possible to the result of clustering the whole set.
Upvotes: 0
Views: 190
Reputation: 8715
If you use single-linkage
, you can use SLINK
which only needs O(n)
memory.
It will still need O(n^2)
time.
Hierarchical clustering
is not very scalable.
CLINK
can do complete linkage
with O(n)
memory, but the result quality is usually not very good (usually worse than Anderberg
with complete linkage
).
All the other algorithms we have are O(n^2)
in memory, unfortunately.
Thus at around 65535 instances you will hit a wall with Java
.
I have one algorithm on my todo list that should be able to run in O(n log n)
if I'm not mistaken with index support
. But I did not get around to study it in detail yet - it's not clear which linkage strategies
it can support besides single-linkage
.
If you have a lot of duplicates, make sure to merge them first!
Upvotes: 1