Reputation: 6224
I am trying to do multilabel classification. But I am really stuck at data preprocessing. My target data in in a separate file. The target data looks like this
Id Tag
0 1 data
1 4 c#
2 4 winforms
3 4 type-conversion
4 4 decimal
I am trying to use MultiLabelBinarizer to preprocess the data. At the end of it, I want it to look something like this -
ID | data | c# | winforms | type-conversion | decimal |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 1 | 1 | 1 |
This is the code that I am using
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
question_tags = pd.read_csv("./archive/question_tags.csv")
print(question_tags.head())
mlb = MultiLabelBinarizer()
print(mlb.fit_transform(question_tags))
This is the output I am getting.
[[1 0 0 1 0]
[0 1 1 0 1]]
What am I doing wrong?
Upvotes: 1
Views: 1124
Reputation: 2122
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
data = pd.DataFrame({
'id': [1, 4, 4, 4, 4],
'tag': ['data', 'c#', 'winforms', 'type-conversion', 'decimal']
})
gr = data.groupby('id')['tag'].apply(list).reset_index()
print(gr)
'''
id tag
0 1 [data]
1 4 [c#, winforms, type-conversion, decimal]
'''
mlb = MultiLabelBinarizer()
tag_binarized = mlb.fit_transform(gr['tag'])
tags_df = pd.DataFrame(tag_binarized,columns = mlb.classes_)
print(tags_df)
'''
c# data decimal type-conversion winforms
0 0 1 0 0 0
1 1 0 1 1 1
'''
res = pd.concat([gr['id'],tags_df],axis=1)
print(res)
'''
id c# data decimal type-conversion winforms
0 1 0 1 0 0 0
1 4 1 0 1 1 1
'''
Upvotes: 0
Reputation: 150735
You are not doing anything wrong. MultiLabelBinarizer()
, as most other sklearn
stuff, returns numpy arrays. In this case, the underlying data looks identical to your expected output, sans the ID
and Tag
names.
Use pd.crosstab
instead:
pd.crosstab(df['Id'], df['Tag'])
Upvotes: 1