Reputation: 1258
I'm trying to do multilabel classification with SVM.
I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer()
but when I use it with my Y data's raw form, it still gives same thing.
I'm running this code:
X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]
test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]
classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)
After all fitting process when it comes to prediction part, it says Label not x is present in all training examples
(x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length).
I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it?
Thanks.
Upvotes: 10
Views: 5625
Reputation: 3316
There are 2 problems here:
1) The missing label warning
2) You are getting all 0's for predictions
The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all()
and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.
As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max()
to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.
If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function
) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.
I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.
If you made it this far, thanks for reading. Hope that helps.
Upvotes: 16