Subhanandh
Subhanandh

Reputation: 151

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe

Structure of the existing dataframe

Structure of the new dataframe

I have tried executing the following code snippet.

for col in products :
    index = 0;
    for item in products.loc[col] :
        products_coded.ix[index, 'prod_' + str(item)] = 1;
        index = index + 1;

It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.

Upvotes: 1

Views: 1305

Answers (1)

jezrael
jezrael

Reputation: 863611

I think you need:

  • first get_dummies with casting values to strings
  • aggregate max by columns names max
  • for correct ordering convert columns to int
  • reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
  • add_prefix for rename columns

df = pd.DataFrame({'B':[3,1,12,12,8],
                   'C':[0,6,0,14,0],
                   'D':[0,14,0,0,0]})

print (df)
    B   C   D
0   3   0   0
1   1   6  14
2  12   0   0
3  12  14   0
4   8   0   0

df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
         .max(level=0, axis=1)
         .rename(columns=lambda x: int(x))
         .reindex(columns=range(1, df.values.max() + 1), fill_value=0)
         .add_prefix('prod_'))
print (df1)

   prod_1  prod_2  prod_3  prod_4  prod_5  prod_6  prod_7  prod_8  prod_9  \
0       0       0       1       0       0       0       0       0       0   
1       1       0       0       0       0       1       0       0       0   
2       0       0       0       0       0       0       0       0       0   
3       0       0       0       0       0       0       0       0       0   
4       0       0       0       0       0       0       0       1       0   

   prod_10  prod_11  prod_12  prod_13  prod_14  
0        0        0        0        0        0  
1        0        0        0        0        1  
2        0        0        1        0        0  
3        0        0        1        0        1  
4        0        0        0        0        0 

Another similar solution:

df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
         .max(level=0, axis=1))

df1.columns = df1.columns.astype(int)    

df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
          .add_prefix('prod_'))

Upvotes: 1

Related Questions