Reputation: 5117
I work with python and about 4000 images of watches (examples: watch_1, watch_2). The images are rgb and their resolution is 450x450. My aim is to find the most similar watches among them. For this reason I am using IncrementalPCA
and partial_fit
of scikit_learn
to handle these big data with my 26GB RAM (see also: SO_Link_1, SO_Link_2). My source code is the following:
import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import IncrementalPCA
from sklearn import neighbors
from sklearn import preprocessing
data = []
# Read images from file #
for filename in glob('Watches/*.jpg'):
img = cv2.imread(filename)
height, width = img.shape[:2]
img = np.array(img)
# Check that all my images are of the same resolution
if height == 450 and width == 450:
# Reshape each image so that it is stored in one line
img = np.concatenate(img, axis=0)
img = np.concatenate(img, axis=0)
data.append(img)
# Normalise data #
data = np.array(data)
Norm = preprocessing.Normalizer()
Norm.fit(data)
data = Norm.transform(data)
# IncrementalPCA model #
ipca = IncrementalPCA(n_components=6)
length = len(data)
chunk_size = 4
pca_data = np.zeros(shape=(length, ipca.n_components))
for i in range(0, length // chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
pca_data[i * chunk_size: (i + 1) * chunk_size] = ipca.transform(data[i*chunk_size : (i+1)*chunk_size])
# K-Nearest neighbours #
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)
print(indices)
However when I run this program for start with 40 images of watches I get the following error when i = 1
:
ValueError: Number of input features has changed from 4 to 6 between calls to partial_fit! Try setting n_components to a fixed value.
However, it is obvious that I set n_components
to 6 when coding ipca = IncrementalPCA(n_components=6)
but for some reason ipca
considers chunk_size = 4
as the number of components when i = 0
and then when i = 1
changes to 6.
Why is this happening?
How can I fix it?
Upvotes: 1
Views: 1186
Reputation: 33532
This seems to follow the math behind PCA as it will be ill-conditioned for n_components > n_samples
.
You might be interested in reading this (introduction of error-message) and some discussion behind it.
Try to increase the batch-size / chunk-size (or lowering n_components).
(In general i'm also somewhat sceptic about this approach. I hope you tested it on some small example-dataset using batch-PCA. It does not seem your watches are preprocessed in regards to geometry: cropping; maybe hist-/color-normalization.)
Upvotes: 2