Sci-Kit Learn SGD Classifier problems predicting

Question

I may not be able to find the help I need here, but I am hoping the smart coders of the internet can help me. I am attempting to use Python's Sci-Kit learn SGDClassifier to classify physical events. These physical events create an image of a track (black and white) and I am trying to get a classifier to classify them. The images are approx 500 * 400 pixels (not quite sure) but for machine-learning purposes it gives me a 200640 dimensional vector. I have 20000 train events serialized in data packages of 200 events. Then I have an extra 2000 train events. Here is how I go about extracting and training.

>>> from sklearn.linear_model import SGDClassifier
>>> import dill
>>> import glob
>>> import numpy as np

>>> clf = SGDClassifier(loss='hinge')

>>>for file in glob.glob('./SerializedData/Batch1/*.pkl'):
...    with open(file, 'rb') as stream:
...    minibatch = dill.load(stream)
...        clf.partial_fit(minibatch.data, minibatch.target, classes=np.classes([1, 2]))
(Some output stuff about the classifier)
>>>

This is the train part of my code, or at least a rough abbreviation of it. I do have a little bit more complicated initialization of the classifier. Just for more info the minibatch.data gives a numpy array of shapes and features so that is a "2-dimensional numpy array" with the shape being 200 and the features being 200640. To clear that up there are arrays describing the pixel values of each image and then 200 of them contained in a big array. minibatch.target gives a numpy array of all the class values of each event

However, after this training of 20000 events I try to test the classifier and it just does not seem to have been trained at all:

>>> file = open('./SerializedData/Batch2/train1.pkl', 'rb')
>>> test = dill.load(file)
>>> clf.predict(test.data)
array([ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2])
>>> clf.score(test.data)
.484999999999999999999

As you can see the classifier is predicting class 2 for all the test events. The only problem I can think of at the moment is that I do not have enough test events but I find that hard to believe. Does anybody have any suggestions/solutions/answers to this predicament?

Sci-Kit Learn SGD Classifier problems predicting

Answers (1)

Related Questions