Reputation: 165

Performing PCA on a large dataset

I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40,845 X 218,904) (unigram). In the case of considering trigrams, it can reach up to (40845 X 3,931,789). Is there a way to perform PCA on such dataset without getting memory or sparse dataset errors. I'm using python sklearn on an 6GB machine.

Upvotes: 2

Answers (3)

dranxo

Reputation: 3388

If you've got 6GB RAM you've got a 64bit machine, so the easiest solution is probably to just up your RAM.

Otherwise, crosspost of this: https://scicomp.stackexchange.com/questions/1681/what-is-the-fastest-way-to-calculate-the-largest-eigenvalue-of-a-general-matrix/7487#7487

There has been some good research on this recently. The new approaches use "randomized algorithms" which only require a few reads of your matrix to get good accuracy on the largest eigenvalues. This is in contrast to power iterations which require several matrix-vector multiplications to reach high accuracy.

You can read more about the new research here:

http://math.berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf

http://arxiv.org/abs/0909.4061

This code will do it for you:

http://cims.nyu.edu/~tygert/software.html

https://bitbucket.org/rcompton/pca_hgdp/raw/be45a1d9a7077b60219f7017af0130c7f43d7b52/pca.m

http://code.google.com/p/redsvd/

https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html

If your language of choice isn't in there you can roll your own randomized SVD pretty easily; it only requires a matrix vector multiplication followed by a call to an off-the-shelf SVD.

Upvotes: 4

Raff.Edward

Reputation: 6544

If you don't need too many components (which you normally don't), you can compute the principal components iteratively. I've always found this to be sufficient in practice.

Upvotes: 0

Newmu

Reputation: 1960

You could try sparse SVD instead, as implemented through TruncatedSVD in sklearn:

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Performing SVD on a document dataset is a common technique, usually referred to as Latent Semantic Analysis (LSA). SVD and PCA are also quite similar. If you would like to know more about the differences, this question has some good information:

https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca

Upvotes: 5

Performing PCA on a large dataset

Answers (3)

Related Questions