Bernardo
Bernardo

Reputation: 129

Dimensionality reduction with SVD specifying energy to keep

I'd like to reduce the dimensionality of a few datasets with SVD. However, the current sklearn interface only allows me to specify the number of components to reduce to (through the n_components parameter).

This feels "hard-coded", as some datasets have much larger dimensions than others, and there is no correct number of components to determine a priori. Specifying the amount of energy to keep from the original matrix (or more specifically, dataset, in this case) is a better option (the fastest "reference" I could grab is in this book, in chapter 11 (to be more specific, page 20 of this PDF, in the "How Many Singular Values Should We Retain?" box).

Is there any way I can do that in scikit-learn, using SVD?

I have tried modifying the source code to allow this, however the current code does an "optimization step" which depends on the number of components passed to the code. If I don't pass the number of components (i.e. leave it at the default number), only 12 components will be decomposed (and the energy calculation will use only these 12 components). In order to do the calculations based on the energy, I have to set n_components to the total number of features of each dataset (going to the safe side), which is extremely slow for some larger datasets.

Any ideas for solving this?

Upvotes: 1

Views: 1454

Answers (1)

Andreas Mueller
Andreas Mueller

Reputation: 28768

As you can see from the documentation, you can use explained variance ratio (which is I think what you are looking for) or using an "mle" estimate.

The PCA class will always compute the full SVD, though, so you wont get a speedup. You can use RandomizedPCA, but that does not allow selecting the number of components based on explained variance ratio. You should try it anyhow, because it will probably be a lot faster than PCA for large datasets, even if you compute all components (assuming n_features is not huge).

Upvotes: 2

Related Questions