Alejandro A
Alejandro A

Reputation: 1190

Transpose based on column values

I have a dataframe with cars and its colors:

colors_df = pd.DataFrame({'id' : [1,2,3], 'Color_Red':[0,1,0],'Color_Blue':[1,0,0],'Color_Green':[1,1,1]})

| id | Color_Red | Color_Blue | Color_Green |
| 1  |.    0.    |     1      |      1      |
| 2  |.    1.    |     0      |      1      |  
| 3  |.    0.    |     0      |      1      |

I want to create an extra column for every color named Color_XXX_Only and be set to value 1 if an id has only the one 1 in it's colors, and additionally a MultiColor column if it has more than one color set to 1. Expected output:

| id | Color_Red | Color_Blue | Color_Green |   Only_Red | Only_Blue | Only_Green | MultiColor|
| 1  |.    0.    |     1      |      1      |.     0.    |      0    |    0       |       1.  |      
| 2  |.    1.    |     0      |      1      |.     0.    |      0    |    0       |       1.  |  
| 3  |.    0.    |     0      |      1      |.     0.    |      0    |    1       |       0   |

1 is multi color because it is blue and green.

2 is multi color because its red and green.

3 is Onlye_blue and multi color because its only green.

Upvotes: 0

Views: 194

Answers (2)

chulo
chulo

Reputation: 63

This approach uses masks and loc

#build masks
m_red=(colors_df.Color_Red==1)
m_blue=(colors_df.Color_Blue==1)
m_green=(colors_df.Color_Green==1)
m_multi=(colors_df.Color_Red+colors_df.Color_Blue+colors_df.Color_Green>1)

#build new columns using loc
colors_df.loc[m_red,'Only_Red']=1
colors_df.loc[m_blue,'Only_Blue']=1
colors_df.loc[m_green,'Only_Green']=1
colors_df.loc[m_multi,'Multi']=1

#use fillna to replace NaN with 0
colors_df.fillna(0)

Upvotes: 0

Henry Ecker
Henry Ecker

Reputation: 35626

One approach (assuming each id is contained in a single row and does not span multiple rows) would be to calculate the number of 1's per row:

  1. filter to get only the Color_ columns
  2. sum across the rows to get the total number of 1s then compare to gt 1 (intuition is that rows with more than a single 1 value are Multicolor).
colors_cols = colors_df.filter(like='Color_')
multi_color = colors_cols.sum(axis=1).gt(1)

multi_color:

0     True
1     True
2    False
dtype: bool

Mask out the Multicolor rows and update the column names:

  1. mask rows that are MultiColor and set the values to 0.
  2. Update the column names with str.replace replacing the word Color_ with Only_
colors_cols = colors_cols.mask(multi_color, 0)
colors_cols.columns = (
    colors_cols.columns.str.replace(r'Color_(.*)', r'Only_\1', regex=True)
)
   Only_Red  Only_Blue  Only_Green
0         0          0           0
1         0          0           0
2         0          0           1

Then join back and add Multicolor column:

  1. join the above dataframe back to colors_df to add the 3 new columns.
  2. Turn multi_color which indicates Multicolor rows from bool to int with astype
colors_df = colors_df.join(colors_cols)
colors_df['MultiColor'] = multi_color.astype(int)

colors_df:

   id  Color_Red  Color_Blue  Color_Green  Only_Red  Only_Blue  Only_Green  MultiColor
0   1          0           1            1         0          0           0           1
1   2          1           0            1         0          0           0           1
2   3          0           0            1         0          0           1           0

Complete Working Example:

import pandas as pd

colors_df = pd.DataFrame({
    'id': [1, 2, 3],
    'Color_Red': [0, 1, 0],
    'Color_Blue': [1, 0, 0],
    'Color_Green': [1, 1, 1]
})

# Create Necessary Boolean Information
colors_cols = colors_df.filter(like='Color_')
multi_color = colors_cols.sum(axis=1).gt(1)

# Build Only_ columns
colors_cols = colors_cols.mask(multi_color, 0)
colors_cols.columns = (
    colors_cols.columns.str.replace(r'Color_(.*)', r'Only_\1', regex=True)
)

# Add Columns to `colors_df`
colors_df = colors_df.join(colors_cols)
colors_df['MultiColor'] = multi_color.astype(int)

# Display
print(colors_df.to_string())

Upvotes: 3

Related Questions