DublinMeUp
DublinMeUp

Reputation: 51

Add a column to df with count of rows in separate df which meet multiple criteria

Have two df's, each with the same columns but differing row counts.

Want to add a column to df1 which counts the rows in df2 which match multiple (not all) column criteria of df1.

Preferably the fastest/most efficient way of doing this as I'll have N pairs of df's with N rows in each.

Psuedo:

count = where (df1.one == df2.one) AND 
              (df1.two between (df2.two * 0.9 AND df2.two * 1.1)) AND
              (df1.three == df2.three) AND
              (df1.four == df2.four)

I've tried lots of iterations of:

df1['count'] = np.count_nonzero((df1.one==df2.one)&(con2)..) 
               and df1['count'] = sum((con1)&(con2)..)
               and df1['count'] = len(df1.loc((con1)&(con2)..))

using .isin() and .values etc but get Value errors basically telling me the df's aren't the same size. Have also tried resetting index's etc etc as per other answers I found on here.

I've seen some similar questions answered using merge and groupby but not sure if possible with the amount of criteria I have + how some of my criteria are "betweens" or range lookups.

Thanks in advance!

Example df's:

first = [(1001,'', 10, 'KK', 5),
         (1002,'A', 9, 'QK' , 7),
         (1003,'B', 11, 'QQ', 11) 
        ]

second = [(1004,'', 10.5, 'KK', 5),
          (1005,'', 9.9, 'KK', 5),
          (1006,'', 10, 'KK', 5),
          (1007,'', 10, 'KQ', 5),
          (1008,'A', 7, 'QK' , 9),
          (1009,'A', 9.1, 'QK' , 7),
          (1010,'A', 9, 'QK' , 7),
          (1011,'A', 9, 'KK' , 7),
          (1012,'B', 12, 'KQ', 9),
          (1013,'B', 11, 'QQ', 11),
          (1014,'B', 11, 'QK', 12),
          (1015,'B', 1, 'QQ', 11)
         ]

df1 = pd.DataFrame(first, columns=['ID', 'one', 'two', 'three','four'])
df2 = pd.DataFrame(second, columns=['ID', 'one', 'two', 'three','four'])

df1:

     ID one  two three  four
0  1001       10    KK     5
1  1002   A    9    QK     7
2  1003   B   11    QQ    11

df2

      ID one   two three  four
0   1004      10.5    KK     5
1   1005       9.9    KK     5
2   1006      10.0    KK     5
3   1007      10.0    KQ     5
4   1008   A   7.0    QK     9
5   1009   A   9.1    QK     7
6   1010   A   9.0    QK     7
7   1011   A   9.0    KK     7
8   1012   B  12.0    KQ     9
9   1013   B  11.0    QQ    11
10  1014   B  11.0    QK    12
11  1015   B   1.0    QQ    11

Desired output(df1):

     ID one  two three  four  count
0  1001       10    KK     5     3
1  1002   A    9    QK     7     2
2  1003   B   11    QQ    11     1

Upvotes: 0

Views: 41

Answers (1)

Sajan
Sajan

Reputation: 1267

You could try this:

def condition(x):
    return df2[(x.one == df2.one) & (x.three == df2.three) & 
               (x.four == df2.four) & (x.two > (df2.two*0.9)) & 
               (x.two < (df2.two*1.1))].shape[0]

df1['count'] = df1.apply(lambda x: condition(x), axis=1)

df1
    ID   one  two three  four  count
0  1001       10    KK     5      3
1  1002   A    9    QK     7      2
2  1003   B   11    QQ    11      1

Upvotes: 1

Related Questions