Lko
Lko

Reputation: 252

MultiLabelBinarizer output classes in letters instead of categories

I have a dataframe where one column is short_names. short_names consist of 2-5 letters of names => BG,OP,LE,WEL,LC. Each row can have any number of names.

I am trying to use MultiLabelBinarizer to convert the names into individual columns such that if the rows have similar names then there will be 1 in the columns

one_hot = MultiLabelBinarizer()
one_hot.fit_transform(df['short_name']) 
one_hot.classes__ 

Because there is a '-' in one the rows which result in an error TypeError: 'float' object is not iterable, I have used

df['short_names']= df['short_names'].astype(str)

The issue now is that the classes output is letters instead of the short names i.e. A, B, C instead of BG OP

Upvotes: 3

Views: 5037

Answers (1)

jezrael
jezrael

Reputation: 862441

I think need dropna for remove missing values with split if necessary:

df = pd.Series({0: np.nan, 1: 'CE', 2: 'NPP', 4: 'SE, CB, CBN, OOM, BCI', 5: 'RCS'})
       .to_frame('short_name')
print (df)
              short_name
0                    NaN
1                     CE
2                    NPP
4  SE, CB, CBN, OOM, BCI
5                    RCS

from sklearn.preprocessing import MultiLabelBinarizer
one_hot = MultiLabelBinarizer()
a = one_hot.fit_transform(df['short_name'].dropna().str.split(', ')) 
print (a)
[[0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [1 1 1 0 0 1 0 1]
 [0 0 0 0 0 0 1 0]]

print(one_hot.classes_ )
['BCI' 'CB' 'CBN' 'CE' 'NPP' 'OOM' 'RCS' 'SE']

If want output DataFrame:

df = pd.DataFrame(a, columns=one_hot.classes_ )
print (df)
   BCI  CB  CBN  CE  NPP  OOM  RCS  SE
0    0   0    0   1    0    0    0   0
1    0   0    0   0    1    0    0   0
2    1   1    1   0    0    1    0   1
3    0   0    0   0    0    0    1   0

Another solution is replace missing values by fillna:

from sklearn.preprocessing import MultiLabelBinarizer
one_hot = MultiLabelBinarizer()
a = one_hot.fit_transform(df['short_name'].fillna('missing').str.split(', ')) 
print (a)
[[0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [1 1 1 0 0 1 0 1 0]
 [0 0 0 0 0 0 1 0 0]]

print(one_hot.classes_ )
['BCI' 'CB' 'CBN' 'CE' 'NPP' 'OOM' 'RCS' 'SE' 'missing']

df = pd.DataFrame(a, columns=one_hot.classes_ )
print (df)
   BCI  CB  CBN  CE  NPP  OOM  RCS  SE  missing
0    0   0    0   0    0    0    0   0        1
1    0   0    0   1    0    0    0   0        0
2    0   0    0   0    1    0    0   0        0
3    1   1    1   0    0    1    0   1        0
4    0   0    0   0    0    0    1   0        0

Upvotes: 3

Related Questions