rescot
rescot

Reputation: 335

multi-label text classification from single label dataset

I have a dataset with a single label per document like the example below.

  label           text

  pay            "i will pay now"
  finance        "are you the finance guy?"
  law            "lawyers and law"
  court          "was at the court today"
  finance report "bank reported annual share.."

The text document can be labelled with more than one label, so how can I do a multi-label classification on this dataset? I've read a lot of documentation from sklearn, but I can't seem to find the right way to do multi-label classification on a single-label dataset. Thanks in advance for any help.

So far, this is what I have:

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn import preprocessing

loc = r'C:\Users\..\Downloads\excel.xlsx'

df = pd.read_excel(loc)
X = np.array(df.docs)
z = np.array(df.title)
y = np.array(df.raw)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42)

mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform(y_train)
Y_test = mlb.fit_transform(y_test)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

  classifier.fit(X_train, Y)
  predicted = classifier.predict(X_test)

 doc_new = np.array(['X has announced that it will sell $587 million'])

 print("Accuracy Score: ", accuracy_score(Y_test, predicted))
 print(mlb.inverse_transform(classifier.predict(doc_new)))

But I keep getting a dimensional error:

.format(len(self.classes_), yt.shape[1]))ValueError: Expected indicator for 44 classes, but got 46

Upvotes: 0

Views: 700

Answers (1)

rescot
rescot

Reputation: 335

I fould the solution. I used pandas GroupBy

df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()

to group text with more than one class together and it worked.

the dimension error has also been solved: dimension error

Upvotes: 0

Related Questions