Pandas: Turn multiple variables into a single set of dummy variables

Question

I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:

DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})

    Col
0   A
1   A, B
2   A, C
3   B, C, D
4   D

My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:

DF2 = DF['Col'].str.split(', ', expand = True)

    0   1     2
0   A   None  None
1   A   B     None
2   A   C     None
3   B   C     D
4   D   None  None

pd.get_dummies(DF2)

    0_A 0_B 0_D 1_B 1_C 2_D
0   1   0   0   0   0   0
1   1   0   0   1   0   0
2   1   0   0   0   1   0
3   0   1   0   0   1   1
4   0   0   1   0   0   0

Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?

piRSquared · Accepted Answer

Simplest way is

DF.Col.str.get_dummies(', ')

   A  B  C  D
0  1  0  0  0
1  1  1  0  0
2  1  0  1  0
3  0  1  1  1
4  0  0  0  1

Slightly more complicated

from sklearn.preprocessing import MultiLabelBinarizer
from numpy.core.defchararray import split

mlb = MultiLabelBinarizer()
s = DF.Col.values.astype(str)
d = mlb.fit_transform(split(s, ', '))

pd.DataFrame(d, columns=mlb.classes_)

   A  B  C  D
0  1  0  0  0
1  1  1  0  0
2  1  0  1  0
3  0  1  1  1
4  0  0  0  1

Pandas: Turn multiple variables into a single set of dummy variables

Answers (2)

Related Questions