Chris Rigano
Chris Rigano

Reputation: 701

hierarchical clustering in MapReduce to implement a dendrogram of representative link communities

I want to use hierarchical clustering with a similarity between links to build a dendrogram where each leaf is a link from the original network and branches represent link communities using hierarchical clustering in MapReduce. Any references approaches algorithms would be appreciated.

Upvotes: 0

Views: 991

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Hierarchical clustering will not work well with MapReduce ever.

First of all, MapReduce is designed to process one record at a time, whereas distance computations do need at least two records at a time. Hacking distance based algorithms into MapReduce usually is a bad idea for this very reason; and this is why MapReduce is pretty much yesterdays news now. Nobody focuses on MapReduce anymore, it is too limited for anything except preprocessing and filtering.

Furthermore, it will not help much. The reason why hierarchical clustering scales inherently bad is that it requires O(n^2) memory and O(n^3) runtime in the usual implementation. So any gains obtained by a distributed implementation will only allow you to process a marginally larger data set: if you add 8 times as many computers, you can only process a 2 times larger data set. This is clearly unsatisfactory for scalability. Plus, you will also need a lot more diskspace. Do you really expect to have O(n^2) disk space available?

Last but not least, if you really have big data, you will not be able to look at the dendrogram anyway.

If you have big data, use a different algorithm. Or sampling.

If your data isn't that large, just use a single-host implementation like ELKI, which is reasonably fast. Most likely, it will outperform a cluster-based implementation because it can avoid all the extra disk-IO you get with mapreduce.

Upvotes: 1

Related Questions