RamBracha
RamBracha

Reputation: 401

kmeans with big data

I want to cluster big data matrix (5 million X 512) with kmeans into 5000 centers. I'm using R in order not to blow my memory with this matrix.

I wrote this code to convert txt matrix into xdf and then cluster:

rxTextToXdf(inFile = inFile, outFile = outFile)
vars <- rxGetInfo(outFile,getVarInfo=TRUE)
myformula <- as.formula(paste("~", paste(names(vars$varInfo), collapse = "+"), sep=""))

clust <- rxKmeans(formula = myformula, data = outFile,numClusters = 5000, algorithm =     "lloyd", overwrite = TRUE)
write.table(clust$centers, file = centersFiletxt, sep=",", row.names=FALSE,    col.names=FALSE)

But it has been running for a week now. Any ideas how to make it faster?

Upvotes: 3

Views: 3904

Answers (3)

vagabond
vagabond

Reputation: 3594

If you can create a sample to represent your data, you could cluster the sample first and then use a classification technique to train a model on it and then predict on chunks of the remaining data to assign clusters.

Training the model will also tell you which variables are not significant and you can reduce the dimensionality that way.

Why increase computation complexity with 5m rows x 512 features x 5000 clusters when you can get more insights by dealing piece meal with the problem?

Upvotes: 1

c-urchin
c-urchin

Reputation: 4534

If you bought RevoR you also paid for support. Why not ask them?

Upvotes: 5

David Marx
David Marx

Reputation: 8568

  1. Do you really need 5000 clusters? k-means performance scales with the number of clusters, so you're hurting yourself quite a bit with such a high number of clusters there. If you can stand to reduce the number of clusters, that will help a lot.

  2. Are you sure you need all 512 dimensions? If you can trim out or combine some of those dimensions that could also help. Have you tried running PCA on your data? Maybe you could try running k-means on just the top 10 components or something like that.

  3. Does it have to be k-means? You could try other algorithms like hierarchical clustering or self-organizing maps and see if those don't perform faster. I'd recommend taking a sample of your data (maybe N=100K) and speed test a few clustering algorithms on that.

  4. Revolution R is definitely supposed to be way faster than base R, but it's still R. K-means is a very simple algorithm to implement: maybe try finding/coding an implementation a bit closer to the metal, like C/C++ or FORTRAN.

  5. Are you tracking your memory usage? Frankly, I suspect you already have blown your memory. At a single iteration, you're asking your computer to build a distance matrix between each of your 5 million points to each of your 5000 centroids in 512 dimensions. This means your distance matrix will be 5M x 5K x 512, or 1.28e13 records (multiply this by the bit encoding of your data type). You only have 6.9e10 bits of RAM. Unless Revolution R is doing something very sneaky, there's simply no possibility of approaching this problem on your hardware unless you buy way, way more RAM. Even with 64 GB, you're still several orders of magnitude short of a single k-means iteration.

  6. You say you're using R in order to not blow your memory usage: maybe Revolution R is different, but conventional R does everything in memory, and as I described above, this problem isn't really going to be tractable on conventional hardware. You should consider renting some time on a more powerful computing cluster like amazon EC2.

  7. k-means is one of those algorithms that's "embarassingly paralellizable." If you rent out server space, you could run this on a hadoop cluster and that should help a lot.

  8. What are you trying to accomplish here? 5000 clusters is a lot. What is the anticipated meaning of your 5000 clusters? I suspect that the real solution here isn't a faster kmeans implementation or more powerful hardware, but rethinking your problem and what you are trying to accomplish.

Upvotes: 10

Related Questions