Reputation: 721
I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:
DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})
Col
0 A
1 A, B
2 A, C
3 B, C, D
4 D
My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:
DF2 = DF['Col'].str.split(', ', expand = True)
0 1 2
0 A None None
1 A B None
2 A C None
3 B C D
4 D None None
pd.get_dummies(DF2)
0_A 0_B 0_D 1_B 1_C 2_D
0 1 0 0 0 0 0
1 1 0 0 1 0 0
2 1 0 0 0 1 0
3 0 1 0 0 1 1
4 0 0 1 0 0 0
Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?
Upvotes: 1
Views: 636
Reputation: 323226
By using pd.crosstab
import pandas as pd
df = pd.DataFrame({'Col':['A', 'A,B', 'A,C', 'B,C,D', 'D']})
df.Col=df.Col.str.split(',')
df1=df.Col.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1)
Out[893]:
col_0 A B C D
row_0
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Upvotes: 1
Reputation: 294258
Simplest way is
DF.Col.str.get_dummies(', ')
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Slightly more complicated
from sklearn.preprocessing import MultiLabelBinarizer
from numpy.core.defchararray import split
mlb = MultiLabelBinarizer()
s = DF.Col.values.astype(str)
d = mlb.fit_transform(split(s, ', '))
pd.DataFrame(d, columns=mlb.classes_)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Upvotes: 6