CodeBabe
CodeBabe

Reputation: 13

How to create new columns from existing columns with get_dummies

I have this Dataframe:

    column1   column2   column3   column4
0     A          A         D         D
1     B          B         D         D 
2     C          C         B         D
3     A          D         D         A 

And I want to add the categories from the columns1 - columns4 as new columns to the dataframe and fill it with 1 and 0 like this:

    column1   column2   column3   column4   A     B     C     D
0     A          A         D         D      1     0     0     1
1     B          B         D         D      0     1     0     1
2     C          C         B         D      0     1     1     1
3     A          D         D         A      1     0     0     1

so I tried some code:

pd.concat([df, df['column1'].str.get_dummies(sep=',')], axis=1)

And I get the 1´s and 0´s from the column1. How can I modify my code to get all the 1´s and 0´s The condition is:

0 & 0 = 0
0 & 1 = 1
1 & 0 = 1
1 & 1 = 1

I also tried:

df1 = df.column1.str.get_dummies(sep=',')
df2 = df.column2.str.get_dummies(sep=',') 
df3 = df.column3.str.get_dummies(sep=',') 
df4 = df.column4.str.get_dummies(sep=',') 
frames = [df1, df2, df3, df4]
result = pd.concat(frames, sort=True)

But I want the categories occur only one time as a new column and the value 1 should represent all the occurences. Can you please help me :)

Upvotes: 1

Views: 204

Answers (1)

N. Tarou
N. Tarou

Reputation: 36

The get_dummies method calculates the number of categories based on the values of the series, so it is needed that the values of the series (column in your case), from where you extract the dummies, to contain all the categories you want to obtain, in other words you must have a column with the values of all 4 columns joined with a separator. In order to put the values all together we use:

new_col = df[['column1', 'column2', 'column3', 'column4']].apply(lambda x: '|'.join(x), axis=1)

which basically joins all the column values into a single string with "|" separator rowwise obtaining this series:

0    A|A|D|D
1    B|B|D|D
2    C|C|B|D
3    A|D|D|A

Now we just apply the get_dummies(sep='|') method to the above column in order to obtain the dummies taking into account all categories present in those columns, in one line of code it can be obtained as (I also concat it with the original dataset in order to obtain the format you asked for):

df = pd.concat([df, df[['column1', 'column2', 'column3', 'column4']]
       .apply(lambda x: '|'.join(x), axis=1)
       .str
       .get_dummies(sep='|')], axis=1)

Upvotes: 1

Related Questions