Reputation: 515
I may not be able to find the help I need here, but I am hoping the smart coders of the internet can help me. I am attempting to use Python's Sci-Kit learn SGDClassifier to classify physical events. These physical events create an image of a track (black and white) and I am trying to get a classifier to classify them. The images are approx 500 * 400 pixels (not quite sure) but for machine-learning purposes it gives me a 200640 dimensional vector. I have 20000 train events serialized in data packages of 200 events. Then I have an extra 2000 train events. Here is how I go about extracting and training.
>>> from sklearn.linear_model import SGDClassifier
>>> import dill
>>> import glob
>>> import numpy as np
>>> clf = SGDClassifier(loss='hinge')
>>>for file in glob.glob('./SerializedData/Batch1/*.pkl'):
... with open(file, 'rb') as stream:
... minibatch = dill.load(stream)
... clf.partial_fit(minibatch.data, minibatch.target, classes=np.classes([1, 2]))
(Some output stuff about the classifier)
>>>
This is the train part of my code, or at least a rough abbreviation of it. I do have a little bit more complicated initialization of the classifier. Just for more info the minibatch.data
gives a numpy array of shapes and features so that is a "2-dimensional numpy array" with the shape being 200 and the features being 200640. To clear that up there are arrays describing the pixel values of each image and then 200 of them contained in a big array. minibatch.target
gives a numpy array of all the class values of each event
However, after this training of 20000 events I try to test the classifier and it just does not seem to have been trained at all:
>>> file = open('./SerializedData/Batch2/train1.pkl', 'rb')
>>> test = dill.load(file)
>>> clf.predict(test.data)
array([ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2])
>>> clf.score(test.data)
.484999999999999999999
As you can see the classifier is predicting class 2 for all the test events. The only problem I can think of at the moment is that I do not have enough test events but I find that hard to believe. Does anybody have any suggestions/solutions/answers to this predicament?
Upvotes: 1
Views: 794
Reputation: 6534
Unless your images are exceptionally simple, you aren't going to get good results using just scikit learn if your inputs are images. You need to transform the images in some way to obtain actually useful features, pixel values make terrible features. You could try using some of the tools in scikit-image to create better features, or you could use some of the pre-trained convolutional neural networks to extract features for you. If you were feeling more adventurous, you could try and train a CNN to do the classification for your problem in particular.
Upvotes: 1