create binary columns in a dataframe from condition on its value

Question

I have a dataframe that looks like this one:

df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A','B','C'])
df.iloc[0,0] = 'a'
df.iloc[1,0] = 'b'
df.iloc[1,1] = 'c'
df.iloc[2,0] = 'b'
df.iloc[3,0] = 'c'
df.iloc[3,1] = 'b'
df.iloc[3,2] = 'd'
df

out :   A   B   C
   0    a   NaN NaN
   1    b   c   NaN
   2    b   NaN NaN
   3    c   b   d

And I would like to add new columns to it which names are the values inside the dataframe (here 'a','b','c',and 'd'). Those columns are binary, and reflect if the values 'a','b','c',and 'd' are in the row.

In one picture, the output I'd like is:

        A   B   C    a   b   c   d
   0    a   NaN NaN  1   0   0   0
   1    b   c   NaN  0   1   1   0
   2    b   NaN NaN  0   1   0   0
   3    c   b   d    0   1   1   1

To do this I first create the columns filled with zeros:

cols = pd.Series(df.values.ravel()).value_counts().index
for col in cols:
    df[col] = 0

(It doesn't create the columns in the right order, but that doesn't matter)

Then I...use a loop over the rows and columns...

for row in df.index:
    for col in cols:
        if col in df.loc[row].values:
            df.ix[row,col] = 1

You'll get why I'm looking for another way to do it, even if my dataframe is relatively small (76k rows), it still takes around 8 minutes, which is far too long.

Any idea?

IanS · Accepted Answer

You're looking for get_dummies. Here I choose to use the .str version:

df.fillna('', inplace=True)
(df.A + '|' + df.B + '|'  + df.C).str.get_dummies()

Output:

   a  b  c  d
0  1  0  0  0
1  0  1  1  0
2  0  1  0  0
3  0  1  1  1

create binary columns in a dataframe from condition on its value

Answers (1)

Related Questions