scikit-learn PCA for image dataset

I am trying to perform PCA on an image dataset with 100.000 images each of size 224x224x3.

I was hoping to project the images into a space of dimension 1000 (or somewhere around that).

I am doing this on my laptop (16gb ram, i7, no GPU) and already set svd_solver='randomized'.

However, fitting takes forever. Is the dataset and the image dimension just too large or is there some trick I could be using?

Thanks!

Edit:

This is the code:

pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit(X)
Z = pca.transform(X)

X is a 100000 x 150528 matrix whose rows represent a flattened image.

Upvotes: 0

Answers (3)

Yngve Moe

Reputation: 1167

You should really reconsider your choice of dimensionality reduction if you think you need 1000 principal components. If you have that many, then you no longer have interpretability so you might as well use other and more flexible dimensionality reduction algorithms (e.g. variational autencoders, t-sne, kernel-PCA). A key benefit of PCA is the interpretability if the principal components.

If you have a video stream of the same place, then you should be fine with <10 components (though principal component pursuit might be better). Moreover, if your image-dataset is not comprised of similar-ish images, then PCA is probably not the right choice.

Also, for images, nonnegative matrix factorisation (NMF) might be better suited. For NMF, you can perform stochastic gradient optimisation, subsampling both pixels and images for each gradient step.

However, if you still insist on performing PCA, then I think that the randomised solver provided by Facebook is the best shot you have. Run pip install fbpca and run the following code

from fbpca import pca

# load data into X
U, s, Vh = pca(X, 1000)

It's not possible to get faster than that without utilising some matrix structure, e.g. sparsity or block composition (which your dataset is unlikely to have).

Also, if you need help to pick the correct number of principal components, I reccomend using this code

import fbpca
from bisect import bisect_left

def compute_explained_variance(singular_values):
    return np.cumsum(singular_values**2)/np.sum(singular_values**2)

def ideal_number_components(X, wanted_explained_variance):
    singular_values = fbpca.svd(X, compute_uv=False)  # This line is a bottleneck. 
    explained_variance = compute_explained_variance(singular_values)
    return bisect_left(explained_variance, wanted_explained_variance)

def auto_pca(X, wanted_explained_variance):
    num_components = ideal_number_components(X, explained_variance)
    return fbpca.pca(X, num_components)    # This line is a bottleneck if the number of components is high

Of course, the above code doesn't support cross validation, which you really should use to choose the correct number of components.

Upvotes: 1

Benales

Reputation: 9

try to experiment with iterated_power parameter of PCA

Upvotes: -1

Benjamin Breton

Reputation: 1547

You can try to set

svd_solver="svd_solver"

The training should be much faster. You could also try to use :

from sklearn.decomposition import FastICA

Which is more scalable Last resort solution could be to turn your images black & white, to reduce the dimension by 3, this might be a good step if your task is not color-sentitive (for instance Optical character Recognition)

Upvotes: 0

scikit-learn PCA for image dataset

Answers (3)

Related Questions