Reputation: 21
I am working on a multi-label text classification problem. In order to encode the labels, I am using MultiLabelBinarizer
. The labels of the dataset look like -
[cs.AI, cs.CL, cs.CV, cs.NE, stat.ML]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]
[stat.ML, cs.AI, cs.CL, cs.LG, cs.NE]
[cs.CL, cs.AI, cs.LG, cs.NE, stat.ML]
When I am using
mlb = MultiLabelBinarizer()
mlb.fit(labels)
print(mlb.classes_)
It gives me -
array([' ', ',', '.', 'A', 'B', 'C', 'D', 'E', 'G', 'H', 'I', 'L', 'M',
'N', 'O', 'P', 'R', 'S', 'T', 'V', 'Y', '[', ']', 'a', 'c', 'h',
'm', 's', 't'], dtype=object)
I (partially) fixed this problem by mlb.fit([y_train])
and I got (I printed first 10 classes) -
array(['[cs.AI, cs.CC]', '[cs.AI, cs.CV]', '[cs.AI, cs.CY]',
'[cs.AI, cs.DB]', '[cs.AI, cs.DS]', '[cs.AI, cs.GT]',
'[cs.AI, cs.HC]', '[cs.AI, cs.IR]', '[cs.AI, cs.LG, stat.ML]',
'[cs.AI, cs.LG]'], dtype=object)
Ideally, it should output the individual classes (there may be something wrong in my code). When I am using mlb.fit_transform([y_train])
, I am getting -
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Help would be very much appreciated.
Upvotes: 2
Views: 1849
Reputation: 496
This solved the problem:
First, you need to get rid of brackets/quotation marks:
data['labels']=data['labels'].str.replace("[","")
data['labels']=data['labels'].str.replace("]","")
data['labels']=data['labels'].str.replace("'","")
after that, you need to define classes between commas:
one_hot = MultiLabelBinarizer()
y_classes = one_hot.fit_transform(data['labels'].str.split(', '))
now you have the encoded labels. This solved my problem which was pretty much similar to yours.
Upvotes: 2
Reputation: 1420
It was not working because your labels are not in the desired format. It should be a list of lists. In your example, you have 5 rows of data, each with 5 labels, your labels should be a list of 5 label lists (1 label list for 1 row of data), each label list with all 5 labels:
y_train= [['cs.AI', 'cs.CL', 'cs.CV', 'cs.NE', 'stat.ML'],
['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML'],
['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML'],
['stat.ML', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.NE'],
['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']]
then binarize the labels using sklearn
:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_label_binarized = mlb.fit_transform(y_train)
and you will get your desired output, where y_label_binarized
is:
array([[1, 1, 1, 0, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1]])
and mlb.classes_
is
array(['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG', 'cs.NE', 'stat.ML'],
dtype=object)
Upvotes: 4
Reputation: 10786
Because when you call mlb.fit([y_train])
, you aren't passing your training data to MultiLabelBinarizer - you're passing in an array with a single element, that array being your training data. This explains why your classes are weird - each "class" is one of your "samples":
array(['[cs.AI, cs.CC]', '[cs.AI, cs.CV]', '[cs.AI, cs.CY]',
'[cs.AI, cs.DB]', '[cs.AI, cs.DS]', '[cs.AI, cs.GT]',
'[cs.AI, cs.HC]', '[cs.AI, cs.IR]', '[cs.AI, cs.LG, stat.ML]',
'[cs.AI, cs.LG]'], dtype=object)
It also explains why when you call fit_transform
, you get one class with every value set to 1 - because you passed it one sample, which contained [cs.AI, cs.CC]
and [cs.AI, cs.CV]
and so on - it interpreted each of those as a class, and your overall data set as a single sample.
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Instead you need to call the various functions with just y_train
:
mlb.fit(y_train)
mlb.fit_transform(y_train)
Upvotes: 1