Reputation: 113
I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15)
, where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data and applied the bag-of-words
word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000)
.
As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
out: (10248, 1000) (10248,)
(4392, 1000) (4392,)
And this is my classifier
svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:Precision score: [0.73980398 0.48169243 0. ]
Accuracy Score: 0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Now I am not sure why the third one, in precision score is blank? I have applied average=None
, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.
Upvotes: 1
Views: 667
Reputation: 60321
As the warning explains:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
it seems that one of your 3 classes is missing from your predictions prediction_int
(i.e. you never predict it); you can easily check if this is the case with
set(Y_test) - set(prediction_int)
which should be the empty set {}
if this is not the case.
If this is indeed the case, and the above operation gives {1}
or {2}
, the most probable reason is that your dataset is imbalanced (you have much more negative
samples), and you do not ask for a stratified split; modify your train_test_split
to
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
and try again.
UPDATE (after comments):
As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive
). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced'
argument in the definition of your classifier, i.e.:
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).
Upvotes: 1