Reputation: 113
Well, I am making a sentiment analysis classifier and I have three classes/labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data to make it standardized and applied the bag-of-words word vectorization technique to the text of twitter for making it feedable to the model, whose size is then (14640, 1000). As the Y, means the label is in the text form so, I applied LabelEncoder so that I can make it in a single line. Like this -
[1 2 1 ... 1 0 1]
This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
out:(10248, 1000) (10248,)
(4392, 1000) (4392,)
stratify=y
will make the imbalanced data into a proper weighted form. For the classifier part, I have used SVM -
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
print(prediction_int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:[0 0 0 ... 1 0 0]
Precision score: [0.74185137 0.50075529 0. ]
Accuracy Score: 0.6691712204007286
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
@desertnaut helped me a lot to decide, what is the actual problem, lastly, I saw that the classifier is unable to predict the third class. You can see that I have printed out prediction_int
and it is not showing any 2
index. Also, it is nowhere near actual labels. I am worried if there is any mistake, happened during classification. This classifier, I made for my binary classification, and I think I do not need to change it for multi-class classification. Can any of you help me to solve this?
Upvotes: 2
Views: 1035
Reputation: 6260
the problem is that the predict_proba method you are using is for binary classification. In a multi classification it gives the probability for each class.
You cannot use this command:
prediction_int = prediction[:,1] >= 0.3
For futher information you can look this similiar post: Multiclass Classification and probability prediction
Update
I just made it after changing all the prediction function to just this single line -
pred = svc.predict(X_test)
As he told, previously I was using my binary classification prediction system. Now this predict
can classify all the 3 labels. So, my precision and recall is working perfectly now.
Upvotes: 1