Reputation: 31
I have already tried everything that I can think of in order to solve my multilabel text classification in Python and I would really appreciate any help. I have based my result in here using multilabelbinarizer and in this web page .
I am trying to predict certain categories in a dataset written in Spanish where I have 7 different labels, where my dataset is shown here. I have a message written and different labels for each of the rows. Each of the text messages has either one or two labels, depending on the message.
df2=df.copy()
df2.drop(["mensaje", "pregunta_parseada", "tags_totales"], axis=1, inplace=True)
# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split(df['pregunta_parseada'],
df2,
test_size=0.15,
random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
lr = LogisticRegression(solver='sag', n_jobs=1)
clf = OneVsRestClassifier(lr)
# fit model on train data
clf.fit(features_train, labels_train)
# make predictions for validation set
y_pred = clf.predict(features_test)
So far, so good, but when I try to validate the problem it seems as almost every category is classified as "None"
y_pred[2]
accuracy_score(y_test,y_pred)
Output
array([0, 0, 0, 0, 0, 0, 0])
0.2574626865671642
I also tried with MultiLabelBinarizer and I had the same problem, what am I doing wrong? Trying with MultiLabelBinarizer raised the following results:
z=[["Generico"],["Mantenimiento"],["Motor"],["Generico"],["Motor"],
["Generico"],["Motor"],["Generico","Configuracion"],["Generico"],
["Motor"],["Consumo"],...,["Consumo"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y=mlb.fit_transform(z)
message = df["pregunta_parseada"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(message,
y,
test_size=0.15,
random_state=42)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
accuracy_score(y_test, predicted)
#predicted[150]
all_labels = mlb.inverse_transform(predicted)
all_labels
With the following output
(),
(),
(),
(),
('Generico',),
(),
(),
(),
(),
('Compra',),
('Motor', 'extras'),
Thank you so much for your help
Upvotes: 3
Views: 1360
Reputation: 11198
The problem I think is with your data. It could be too sparse.
I see you're using OneVsRestClassifier
, so it builds multiple binary classifiers to decide the tags.
I think, there's no straight-forward bug in your code, but the choice of model is just not right for the task.
The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples (n
) per class (c
), the binary classifier will divide the data into n
vs (n-1) x c
samples for the positive and negative class.
So, obviously there is more data in negative class than positive class for all the classifiers. They are biased towards the negative class, as a result each binary classifier tends to predict (All in oneVsall scenario) for most of the cases.
If you don't want to change your setup, then one thing you can do is:
predict
, use predict_proba
to get the probability per class and set a lower threshold (<0.5) to decide which set of classes to choose.Your test accuracy is pretty low, maybe re-adjust the threshold to get better accuracy.
Upvotes: 1