Henry Zhu
Henry Zhu

Reputation: 2628

Sklearn Multilabel ML: ValueError: Multioutput target data is not supported with label binarization

I am building a program that assigns multiple labels/tags to textual descriptions. I am using the OneVsRestClassifier to label my textual descriptions. xTrain, xTest, and yTrain are all 'numpy.ndarray'. This does seem strange considering that I have splitting the training and test data in the correct manner. Below is my code:

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)

nb_clf = MultinomialNB()
sgd = SGDClassifier()
lr = LogisticRegression()
mn = MultinomialNB()

print("xTrain.shape = " + str(xTrain.shape))
print("xTest.shape = " + str(xTest.shape))
print("yTrain.shape = " + str(yTrain.shape))
print("yTest.shape = " + str(yTest.shape))

print("type(xTrain) = " + str(type(xTrain)))
print("type(xTest) = " + str(type(xTest)))

xTrain = csr_matrix(xTrain).toarray()
xTest = csr_matrix(xTest).toarray()
yTrain = csr_matrix(yTrain).toarray()

print("type(xTrain) = " + str(type(xTrain)))

for classifier in [nb_clf, sgd, lr, mn]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(xTrain.astype("U"), yTrain.astype("U"))
    y_pred = clf.predict(xTest)
    print("\ny_pred:")
    print(y_pred)

x output:

  (1466, 1292)  0.13531037414782607
  (1466, 1238)  0.21029405543816293
  (1466, 988)   0.04688335706505732
  ...
  ...

y ouput:

[[0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

print statements output:

xTrain.shape = (1173, 13817)
xTest.shape = (294, 13817)
yTrain.shape = (1173, 28)
yTest.shape = (294, 28)
type(xTrain) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTest) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTrain) = <class 'numpy.ndarray'>
type(xTest) = <class 'numpy.ndarray'>
type(yTrain) = <class 'numpy.ndarray'>

error (at the clf.fit line):

ValueError: Multioutput target data is not supported with label binarization

Upvotes: 0

Views: 2368

Answers (1)

monkeyking9528
monkeyking9528

Reputation: 126

Please first clarify the feature dimension as well as sample size in your program. For the target feature (y), the label should not be one-hot encoded. For example, instead of [0 0 0 1], it should be [3]

Upvotes: 1

Related Questions