5t3fth35rb
5t3fth35rb

Reputation: 11

100% Accuracy using SVC classification, something must be wrong?

Context to what I'm trying to achieve:

I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).

import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC

def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature

trnArray = np.zeros([10000,324]) 
tstArray = np.zeros([1000,324])

for i in range (0, 10000 ):
    trnFeatures = computeFeatures(trnImages[:,:,:,i])
    trnArray[i,:] = trnFeatures


for i in range (0, 1000):
    tstFeatures = computeFeatures(tstImages[:,:,:,i])
    tstArray[i,:] = tstFeatures


pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)

# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels 
train_data = trnModel
train_labels = trnLabels 

C = 1 
model = SVC(kernel='linear', C=C)

model.fit(train_data, train_labels.ravel())

y_pred = model.predict(test_data)

accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0] 
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))

Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?

Upvotes: 1

Views: 1706

Answers (3)

andrewchauzov
andrewchauzov

Reputation: 1009

First of all,

pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)

this is wrong. You have to use:

tstModel = pca.transform(tstArray)

Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.

Just for interest, check the balance of classes.

Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.

Upvotes: 1

dedObed
dedObed

Reputation: 1363

One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.

This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.

Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?

Upvotes: 0

M__
M__

Reputation: 636

The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.

I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?

The PCA is standard for images, but should be followed by something very simple such as K-means.

The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.

Otherwise the calculation looks fine.


accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0] 

On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).

I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question. BTW could some explain, "shape[0]"please? Is it needed?

Upvotes: 0

Related Questions