mnowotka
mnowotka

Reputation: 17258

Clustering huge data matrix in python?

I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...

I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?

I guess I can't just pass pyTables object to one of scikit learn clustering methods...

Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?

Upvotes: 1

Views: 2704

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

Maybe you should look at algorithms that don't need a full distance matrix.

I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...

Upvotes: 4

jacek2v
jacek2v

Reputation: 590

I think the main problem is memory. 1,5 x 1,5 million x 10B (1 element size) > 20TB You can use bigdata database like pyTables, Hadoop http://en.wikipedia.org/wiki/Apache_Hadoop and MapReduce algorithm.

Here some guides: http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

Or use Google App Engine Datastore with MapReduce https://developers.google.com/appengine/docs/python/dataprocessing/ - but now it isn't production version

Upvotes: 1

Related Questions