Add flags to DataFrame based on (grouped) conditions

Question

I'm working with a pandas DataFrame having the following structure:

df1 = pd.DataFrame({'left_name' : ['left_name1', 'left_name2', 'left_name3', 'left_name4', 'left_name5', 'right_name6', 'right_name7', 'right_name8'], 
                    'right_name' : ['right_name1', 'right_name2', 'right_name2', 'right_name2', 'right_name3', 'right_name4', 'right_name4', 'right_name5'], 
                    'score' : [0.98, 0.99, 0.97, 0.91, 1, 0.92, 0.90, 0.96]})

print(df1)

     left_name   right_name  score
0   left_name1  right_name1   0.98
1   left_name2  right_name2   0.99
2   left_name3  right_name2   0.97
3   left_name4  right_name2   0.91
4   left_name5  right_name3   1.00
5  right_name6  right_name4   0.92
6  right_name7  right_name4   0.90
7  right_name8  right_name5   0.96

I would like to add two new columns to the above table. Visually, the resulting DataFrame should look like this:

     left_name   right_name  score       col1  col2
0   left_name1  right_name1   0.98      MATCH     1
1   left_name2  right_name2   0.99  POTENTIAL     1
2   left_name3  right_name2   0.97  POTENTIAL     0
3   left_name4  right_name2   0.91  POTENTIAL     0
4   left_name5  right_name3   1.00      MATCH     1
5  right_name6  right_name4   0.92  POTENTIAL     1
6  right_name7  right_name4   0.90  POTENTIAL     0
7  right_name8  right_name5   0.96      MATCH     1

The rules to create the two new columns are the following:

For the col1 Series: unique right_names are set to MATCH and duplicates right_names are set to POTENTIAL.
For the col2 Series:
- MATCHes are flagged with 1;
- POTENTIALs with the highest scores are flagged to 1 for each group of right_name (the rest is mapped to 0);

I'm having a hard time translating the above rules to Python/Pandas code. Any help on how to think and code this would be appreciated.

jezrael · Accepted Answer

Solution if need to match maximum value per groups - if exist 1 value per groups and is max, then select it:

m = df1.groupby('right_name')['score'].transform('max').eq(df1['score']).astype(int)

df1['col1'] = np.where(df1['right_name'].duplicated(keep=False),'POTENTIAL', 'MATCH')
df1['col2'] = np.where(m, 1, 0)
print (df1)
     left_name   right_name  score       col1  col2
0   left_name1  right_name1   0.98      MATCH     1
1   left_name2  right_name2   0.99  POTENTIAL     1
2   left_name3  right_name2   0.97  POTENTIAL     0
3   left_name4  right_name2   0.91  POTENTIAL     0
4   left_name5  right_name3   1.00      MATCH     1
5  right_name6  right_name4   0.92  POTENTIAL     1
6  right_name7  right_name4   0.90  POTENTIAL     0
7  right_name8  right_name5   0.96      MATCH     1

Or remove away all 1 rows, get maximum per groups with added 1 rows with | for bitwise OR:

m = (df1[df1['score'].ne(1)]
       .groupby('right_name')['score'].transform('max')
       .eq(df1['score']).astype(int))

df1['col1'] = np.where(df1['right_name'].duplicated(keep=False),'POTENTIAL', 'MATCH')
df1['col2'] = np.where(m | df1['score'].eq(1), 1, 0)
print (df1)
     left_name   right_name  score       col1  col2
0   left_name1  right_name1   0.98      MATCH     1
1   left_name2  right_name2   0.99  POTENTIAL     1
2   left_name3  right_name2   0.97  POTENTIAL     0
3   left_name4  right_name2   0.91  POTENTIAL     0
4   left_name5  right_name3   1.00      MATCH     1
5  right_name6  right_name4   0.92  POTENTIAL     1
6  right_name7  right_name4   0.90  POTENTIAL     0
7  right_name8  right_name5   0.96      MATCH     1

Check difference in changed sample data:

df1 = pd.DataFrame({'left_name' : ['left_name1', 'left_name2', 'left_name3', 'left_name4', 'left_name5', 'right_name6', 'right_name7', 'right_name8'], 
                    'right_name' : ['right_name1', 'right_name2', 'right_name2', 'right_name2', 'right_name3', 'right_name4', 'right_name4', 'right_name5'], 
                    'score' : [0.98, 0.99, 0.97, 0.91, 1, 1.00, 0.90, 0.96]})

#print(df1)


m1 = df1.groupby('right_name')['score'].transform('max').eq(df1['score']).astype(int)
m2 = df1[df1['score'].ne(1)].groupby('right_name')['score'].transform('max').eq(df1['score']).astype(int)

df1['col1'] = np.where(df1['right_name'].duplicated(keep=False),'POTENTIAL', 'MATCH')
df1['col21'] = np.where(m, 1, 0)
df1['col22'] = np.where(m2 | df1['score'].eq(1), 1, 0)
print (df1)
     left_name   right_name  score       col1  col21  col22
0   left_name1  right_name1   0.98      MATCH      1      1
1   left_name2  right_name2   0.99  POTENTIAL      1      1
2   left_name3  right_name2   0.97  POTENTIAL      0      0
3   left_name4  right_name2   0.91  POTENTIAL      0      0
4   left_name5  right_name3   1.00      MATCH      0      1
5  right_name6  right_name4   1.00  POTENTIAL      1      1
6  right_name7  right_name4   0.90  POTENTIAL      0      1
7  right_name8  right_name5   0.96      MATCH      1      1

Add flags to DataFrame based on (grouped) conditions

Answers (1)

Related Questions