Outcast
Outcast

Reputation: 5117

IncrementalPCA & partial_fit - number of components

I work with python and about 4000 images of watches (examples: watch_1, watch_2). The images are rgb and their resolution is 450x450. My aim is to find the most similar watches among them. For this reason I am using IncrementalPCA and partial_fit of scikit_learn to handle these big data with my 26GB RAM (see also: SO_Link_1, SO_Link_2). My source code is the following:

import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import IncrementalPCA
from sklearn import neighbors
from sklearn import preprocessing


data = []

# Read images from file #
for filename in glob('Watches/*.jpg'):

    img = cv2.imread(filename)
    height, width = img.shape[:2]
    img = np.array(img)

    # Check that all my images are of the same resolution
    if height == 450 and width == 450:

        # Reshape each image so that it is stored in one line
        img = np.concatenate(img, axis=0)
        img = np.concatenate(img, axis=0)
        data.append(img)

# Normalise data #
data = np.array(data)
Norm = preprocessing.Normalizer()
Norm.fit(data)
data = Norm.transform(data)

# IncrementalPCA model #
ipca = IncrementalPCA(n_components=6)

length = len(data)
chunk_size = 4
pca_data = np.zeros(shape=(length, ipca.n_components))

for i in range(0, length // chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
    pca_data[i * chunk_size: (i + 1) * chunk_size] = ipca.transform(data[i*chunk_size : (i+1)*chunk_size])

# K-Nearest neighbours #
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)
print(indices)

However when I run this program for start with 40 images of watches I get the following error when i = 1:

ValueError: Number of input features has changed from 4 to 6 between calls to partial_fit! Try setting n_components to a fixed value.

However, it is obvious that I set n_components to 6 when coding ipca = IncrementalPCA(n_components=6) but for some reason ipca considers chunk_size = 4 as the number of components when i = 0 and then when i = 1 changes to 6.

Why is this happening?

How can I fix it?

Upvotes: 1

Views: 1186

Answers (1)

sascha
sascha

Reputation: 33532

This seems to follow the math behind PCA as it will be ill-conditioned for n_components > n_samples.

You might be interested in reading this (introduction of error-message) and some discussion behind it.

Try to increase the batch-size / chunk-size (or lowering n_components).

(In general i'm also somewhat sceptic about this approach. I hope you tested it on some small example-dataset using batch-PCA. It does not seem your watches are preprocessed in regards to geometry: cropping; maybe hist-/color-normalization.)

Upvotes: 2

Related Questions