Scott Stewart
Scott Stewart

Reputation: 203

Using MultilabelBinarizer on test data with labels not in the training set

Given this simple example of multilabel classification (taken from this question, use scikit-learn to classify into multiple categories)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    ["new york"],
            ["new york"],["london"],["london"],["london"],["london"],
            ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)


print "Accuracy Score: ",accuracy_score(Y_test, predicted)

The code runs fine, and prints the accuracy score, however if I change y_test_text to

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

I get

Traceback (most recent call last):
  File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
     print "Accuracy Score: ",accuracy_score(Y_test, predicted)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

Notice the introduction of the 'england' label which is not in the training set. How do I use multilabel classification so that if a "test" label is introduced, i can still run some some of metrics? Or is that even possible?

EDIT: Thanks for answers guys, I guess my question is more about how the scikit binarizer works or should work. Given my short sample code, i would also expect if i changed y_test_text to

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

That it would work--i mean we have fitted for that label, but in this case I get

ValueError: Can't handle mix of binary and multilabel-indicator

Upvotes: 18

Views: 21639

Answers (3)

javigzz
javigzz

Reputation: 942

As mentioned in another comment, personally I would expect the binarizer to ignore the not seen classes at "transform" time. The classifier which is consuming the outcome of the binarizer might not react well if the features presented by the testing samples are different than what was used on training.

I addressed the issue just removing the non seen classes from the sample. I think is a safer approach than dynamically changing the fitted binarizer or (another option) extending it to allow for ignoring.

list(map(lambda names: np.intersect1d(lb.classes_, names), y_test_text))

didn't run with you actual code

Upvotes: 0

Geeocode
Geeocode

Reputation: 5797

You can, if you "introduce" the new label in the training y set too, like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    
                ["new york"],["new york"],["london"],["london"],         
                ["london"],["london"],["london"],["london"],
                ["new york","England"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]


lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

print Y_test

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted

print "Accuracy Score: ",accuracy_score(Y_test, predicted)

Output:

Accuracy Score:  0.571428571429

The key section is:

y_train_text = [["new york"],["new york"],["new york"],
                ["new york"],["new york"],["new york"],
                ["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","England"],
                ["new york","london"]]

Where we inserted "England" too. It makes sense, because other way how can predict the classifier some label if he didn't see it before? So we created a three label classification problem this way.

EDITED:

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

You have to pass the classes as arg to MultiLabelBinarizer() and it will work with any y_test_text.

Upvotes: 16

lejlot
lejlot

Reputation: 66775

In short - it is ill-posed problem. Classification assumes that all labels are known in advance, and so does binarizer . Fit it on all labels, and then train on any subset you want.

Upvotes: 4

Related Questions