odbhut.shei.chhele
odbhut.shei.chhele

Reputation: 6224

How to use MultiLabelBinarizer for Multilabel classification?

I am trying to do multilabel classification. But I am really stuck at data preprocessing. My target data in in a separate file. The target data looks like this

   Id              Tag
0   1             data
1   4               c#
2   4         winforms
3   4  type-conversion
4   4          decimal

I am trying to use MultiLabelBinarizer to preprocess the data. At the end of it, I want it to look something like this -

ID data c# winforms type-conversion decimal
1 1 0 0 0 0
4 0 1 1 1 1

This is the code that I am using

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

question_tags = pd.read_csv("./archive/question_tags.csv")
print(question_tags.head())
mlb = MultiLabelBinarizer()
print(mlb.fit_transform(question_tags))

This is the output I am getting.

[[1 0 0 1 0]
 [0 1 1 0 1]]

What am I doing wrong?

Upvotes: 1

Views: 1124

Answers (2)

Soudipta Dutta
Soudipta Dutta

Reputation: 2122

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

data = pd.DataFrame({
    'id': [1, 4, 4, 4, 4],
    'tag': ['data', 'c#', 'winforms', 'type-conversion', 'decimal']
})

gr = data.groupby('id')['tag'].apply(list).reset_index()
print(gr)
'''
   id                                       tag
0   1                                    [data]
1   4  [c#, winforms, type-conversion, decimal]
'''
mlb = MultiLabelBinarizer()
tag_binarized = mlb.fit_transform(gr['tag'])
tags_df = pd.DataFrame(tag_binarized,columns = mlb.classes_)
print(tags_df)
'''
  c#  data  decimal  type-conversion  winforms
0   0     1        0                0         0
1   1     0        1                1         1
'''
res = pd.concat([gr['id'],tags_df],axis=1)
print(res)
'''
  id  c#  data  decimal  type-conversion  winforms
0   1   0     1        0                0         0
1   4   1     0        1                1         1
'''

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150735

You are not doing anything wrong. MultiLabelBinarizer(), as most other sklearn stuff, returns numpy arrays. In this case, the underlying data looks identical to your expected output, sans the ID and Tag names.

Use pd.crosstab instead:

pd.crosstab(df['Id'], df['Tag'])

Upvotes: 1

Related Questions