Reputation: 165
I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40,845 X 218,904) (unigram). In the case of considering trigrams, it can reach up to (40845 X 3,931,789). Is there a way to perform PCA on such dataset without getting memory or sparse dataset errors. I'm using python sklearn on an 6GB machine.
Upvotes: 2
Views: 3711
Reputation: 3388
If you've got 6GB RAM you've got a 64bit machine, so the easiest solution is probably to just up your RAM.
Otherwise, crosspost of this: https://scicomp.stackexchange.com/questions/1681/what-is-the-fastest-way-to-calculate-the-largest-eigenvalue-of-a-general-matrix/7487#7487
There has been some good research on this recently. The new approaches use "randomized algorithms" which only require a few reads of your matrix to get good accuracy on the largest eigenvalues. This is in contrast to power iterations which require several matrix-vector multiplications to reach high accuracy.
You can read more about the new research here:
http://math.berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf
http://arxiv.org/abs/0909.4061
This code will do it for you:
http://cims.nyu.edu/~tygert/software.html
https://bitbucket.org/rcompton/pca_hgdp/raw/be45a1d9a7077b60219f7017af0130c7f43d7b52/pca.m
http://code.google.com/p/redsvd/
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html
If your language of choice isn't in there you can roll your own randomized SVD pretty easily; it only requires a matrix vector multiplication followed by a call to an off-the-shelf SVD.
Upvotes: 4
Reputation: 6544
If you don't need too many components (which you normally don't), you can compute the principal components iteratively. I've always found this to be sufficient in practice.
Upvotes: 0
Reputation: 1960
You could try sparse SVD instead, as implemented through TruncatedSVD in sklearn:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Performing SVD on a document dataset is a common technique, usually referred to as Latent Semantic Analysis (LSA). SVD and PCA are also quite similar. If you would like to know more about the differences, this question has some good information:
https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca
Upvotes: 5