Isabel Kofoed Alonso
Isabel Kofoed Alonso

Reputation: 31

Multilabel text classification with Sklearn

I have already tried everything that I can think of in order to solve my multilabel text classification in Python and I would really appreciate any help. I have based my result in here using multilabelbinarizer and in this web page .

I am trying to predict certain categories in a dataset written in Spanish where I have 7 different labels, where my dataset is shown here. I have a message written and different labels for each of the rows. Each of the text messages has either one or two labels, depending on the message.

df2=df.copy()
df2.drop(["mensaje", "pregunta_parseada", "tags_totales"], axis=1, inplace=True)

# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split(df['pregunta_parseada'], 
                                                df2,
                                                test_size=0.15, 
                                                random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier


lr = LogisticRegression(solver='sag', n_jobs=1)
clf = OneVsRestClassifier(lr)

# fit model on train data
clf.fit(features_train, labels_train)

# make predictions for validation set
y_pred = clf.predict(features_test)

So far, so good, but when I try to validate the problem it seems as almost every category is classified as "None"

y_pred[2]
accuracy_score(y_test,y_pred)

Output

array([0, 0, 0, 0, 0, 0, 0])
0.2574626865671642

I also tried with MultiLabelBinarizer and I had the same problem, what am I doing wrong? Trying with MultiLabelBinarizer raised the following results:

z=[["Generico"],["Mantenimiento"],["Motor"],["Generico"],["Motor"], 
["Generico"],["Motor"],["Generico","Configuracion"],["Generico"], 
["Motor"],["Consumo"],...,["Consumo"]]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y=mlb.fit_transform(z)

message = df["pregunta_parseada"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(message, 
                                                y, 
                                                test_size=0.15, 
                                                random_state=42)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

 classifier.fit(X_train, y_train)
 predicted = classifier.predict(X_test)
 accuracy_score(y_test, predicted)
 #predicted[150]
 all_labels = mlb.inverse_transform(predicted)
 all_labels

With the following output

 (),
 (),
 (),
 (),
 ('Generico',),
 (),
 (),
 (),
 (),
 ('Compra',),
 ('Motor', 'extras'),

Thank you so much for your help

Upvotes: 3

Views: 1360

Answers (1)

Zabir Al Nazi Nabil
Zabir Al Nazi Nabil

Reputation: 11198

The problem I think is with your data. It could be too sparse.

I see you're using OneVsRestClassifier, so it builds multiple binary classifiers to decide the tags.

I think, there's no straight-forward bug in your code, but the choice of model is just not right for the task.

The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples (n) per class (c), the binary classifier will divide the data into n vs (n-1) x c samples for the positive and negative class.

So, obviously there is more data in negative class than positive class for all the classifiers. They are biased towards the negative class, as a result each binary classifier tends to predict (All in oneVsall scenario) for most of the cases.

If you don't want to change your setup, then one thing you can do is:

  1. Instead of predict, use predict_proba to get the probability per class and set a lower threshold (<0.5) to decide which set of classes to choose.

Your test accuracy is pretty low, maybe re-adjust the threshold to get better accuracy.

  1. Use Deep Learning based approach if possible like Bert which will give much better performance.

Upvotes: 1

Related Questions