Agnes Lee
Agnes Lee

Reputation: 321

pandas: Create new column by comparing DataFrame rows with columns of another DataFrame

Assume I have df1:

df1= pd.DataFrame({'alligator_apple': range(1, 11),
                  'barbadine': range(11, 21),
                  'capulin_cherry': range(21, 31)})

   alligator_apple  barbadine  capulin_cherry
0                1         11              21
1                2         12              22
2                3         13              23
3                4         14              24
4                5         15              25
5                6         16              26
6                7         17              27
7                8         18              28
8                9         19              29
9               10         20              30

And a df2:

df2= pd.DataFrame({'alligator_apple': [6, 7, 15, 5],
                  'barbadine': [3, 19, 25, 12],
                  'capulin_cherry': [1, 9, 15, 27]})

   alligator_apple  barbadine  capulin_cherry
0                6          3               1
1                7         19               9
2               15         25              15
3                5         12              27

I'm looking for a way to create a new column in df2 that gets number of rows based on a condition where all columns in df1 has values greater than their counterparts in df2 for each row. For example:

   alligator_apple  barbadine  capulin_cherry  greater
0                6          3               1       4
1                7         19               9       1
2               15         25              15       0
3                5         12              27       3

To elaborate, at row 0 of df2, df1.alligator_apple has 4 rows which values are higher than df2.alligator_apple with the value of 6. df1.barbadine has 10 rows which values are higher than df2.barbadine with value of 3, while similarly df1.capulin_cherry has 10 rows.

Finally, apply an 'and' condition to all aforementioned conditions to get the number '4' of df2.greater of first row. Repeat for the rest of rows in df2.

Is there a simple way to do this?

Upvotes: 2

Views: 1121

Answers (2)

Clay Shwery
Clay Shwery

Reputation: 380

There's a general solution to this that should work well.

def gt_mask(row,df):
    mask = True
    for key,val in row.items():
        mask &= df[key] > val
    return len(df[mask])

df2['greater'] = df2.apply(gt_mask,df=df1,axis=1)

Output df2

,alligator_apple,barbadine,capulin_cherry,greater
0,6,3,1,4
1,7,19,9,1
2,15,25,15,0
3,5,12,27,3

This creates a mask, iterating through the key/val pairs for a given row.

Edit this answer was a big help: Masking a DataFrame on multiple column conditions - inside a loop

Upvotes: 2

jfaccioni
jfaccioni

Reputation: 7509

I believe this does what you want:

df2['greater'] = df2.apply(
    lambda row: 
    (df1['alligator_apple'] > row['alligator_apple']) & 
    (df1['barbadine'] > row['barbadine']) & 
    (df1['capulin_cherry'] > row['capulin_cherry']), 
    axis=1,
).sum(axis=1)

print(df2)

output:

   alligator_apple  barbadine  capulin_cherry  greater
0                6          3               1        4
1                7         19               9        1
2               15         25              15        0
3                5         12              27        3

Edit: if you want to generalize and apply this logic for a given column set, we can use functools.reduce together with operator.and_:

import functools
import operator

columns = ['alligator_apple', 'barbadine', 'capulin_cherry']

df2['greater'] = df2.apply(
    lambda row: functools.reduce(
        operator.and_, 
        (df1[column] > row[column] for column in columns),
    ), 
    axis=1,
).sum(axis=1)

Upvotes: 2

Related Questions