sayak
sayak

Reputation: 21

MultiLabelBinarizer gives individual characters instead of the classes

I am working on a multi-label text classification problem. In order to encode the labels, I am using MultiLabelBinarizer. The labels of the dataset look like -

[cs.AI, cs.CL, cs.CV, cs.NE, stat.ML]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]
[stat.ML, cs.AI, cs.CL, cs.LG, cs.NE]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]

When I am using

mlb = MultiLabelBinarizer()
mlb.fit(labels)
print(mlb.classes_)

It gives me -

array([' ', ',', '.', 'A', 'B', 'C', 'D', 'E', 'G', 'H', 'I', 'L', 'M',
       'N', 'O', 'P', 'R', 'S', 'T', 'V', 'Y', '[', ']', 'a', 'c', 'h',
       'm', 's', 't'], dtype=object)

I (partially) fixed this problem by mlb.fit([y_train]) and I got (I printed first 10 classes) -

array(['[cs.AI, cs.CC]', '[cs.AI, cs.CV]', '[cs.AI, cs.CY]',
       '[cs.AI, cs.DB]', '[cs.AI, cs.DS]', '[cs.AI, cs.GT]',
       '[cs.AI, cs.HC]', '[cs.AI, cs.IR]', '[cs.AI, cs.LG, stat.ML]',
       '[cs.AI, cs.LG]'], dtype=object)

Ideally, it should output the individual classes (there may be something wrong in my code). When I am using mlb.fit_transform([y_train]), I am getting -

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Help would be very much appreciated.

Upvotes: 2

Views: 1849

Answers (3)

vagitus
vagitus

Reputation: 496

This solved the problem:

First, you need to get rid of brackets/quotation marks:

data['labels']=data['labels'].str.replace("[","")
data['labels']=data['labels'].str.replace("]","")
data['labels']=data['labels'].str.replace("'","")

after that, you need to define classes between commas:

one_hot = MultiLabelBinarizer()

y_classes = one_hot.fit_transform(data['labels'].str.split(', '))

now you have the encoded labels. This solved my problem which was pretty much similar to yours.

Upvotes: 2

Toukenize
Toukenize

Reputation: 1420

It was not working because your labels are not in the desired format. It should be a list of lists. In your example, you have 5 rows of data, each with 5 labels, your labels should be a list of 5 label lists (1 label list for 1 row of data), each label list with all 5 labels:

y_train= [['cs.AI', 'cs.CL', 'cs.CV', 'cs.NE', 'stat.ML'],
          ['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML'],
          ['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML'],
          ['stat.ML', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.NE'],
          ['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']]

then binarize the labels using sklearn:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y_label_binarized = mlb.fit_transform(y_train)

and you will get your desired output, where y_label_binarized is:

array([[1, 1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1, 1]])

and mlb.classes_ is

array(['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG', 'cs.NE', 'stat.ML'],
      dtype=object)

Upvotes: 4

Dan
Dan

Reputation: 10786

Because when you call mlb.fit([y_train]), you aren't passing your training data to MultiLabelBinarizer - you're passing in an array with a single element, that array being your training data. This explains why your classes are weird - each "class" is one of your "samples":

array(['[cs.AI, cs.CC]', '[cs.AI, cs.CV]', '[cs.AI, cs.CY]',
       '[cs.AI, cs.DB]', '[cs.AI, cs.DS]', '[cs.AI, cs.GT]',
       '[cs.AI, cs.HC]', '[cs.AI, cs.IR]', '[cs.AI, cs.LG, stat.ML]',
       '[cs.AI, cs.LG]'], dtype=object)

It also explains why when you call fit_transform, you get one class with every value set to 1 - because you passed it one sample, which contained [cs.AI, cs.CC] and [cs.AI, cs.CV] and so on - it interpreted each of those as a class, and your overall data set as a single sample.

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Instead you need to call the various functions with just y_train:

mlb.fit(y_train)
mlb.fit_transform(y_train)

Upvotes: 1

Related Questions