Reputation: 51
Have two df's, each with the same columns but differing row counts.
Want to add a column to df1 which counts the rows in df2 which match multiple (not all) column criteria of df1.
Preferably the fastest/most efficient way of doing this as I'll have N pairs of df's with N rows in each.
Psuedo:
count = where (df1.one == df2.one) AND
(df1.two between (df2.two * 0.9 AND df2.two * 1.1)) AND
(df1.three == df2.three) AND
(df1.four == df2.four)
I've tried lots of iterations of:
df1['count'] = np.count_nonzero((df1.one==df2.one)&(con2)..)
and df1['count'] = sum((con1)&(con2)..)
and df1['count'] = len(df1.loc((con1)&(con2)..))
using .isin()
and .values
etc but get Value errors basically telling me the df's aren't the same size. Have also tried resetting index's etc etc as per other answers I found on here.
I've seen some similar questions answered using merge and groupby but not sure if possible with the amount of criteria I have + how some of my criteria are "betweens" or range lookups.
Thanks in advance!
Example df's:
first = [(1001,'', 10, 'KK', 5),
(1002,'A', 9, 'QK' , 7),
(1003,'B', 11, 'QQ', 11)
]
second = [(1004,'', 10.5, 'KK', 5),
(1005,'', 9.9, 'KK', 5),
(1006,'', 10, 'KK', 5),
(1007,'', 10, 'KQ', 5),
(1008,'A', 7, 'QK' , 9),
(1009,'A', 9.1, 'QK' , 7),
(1010,'A', 9, 'QK' , 7),
(1011,'A', 9, 'KK' , 7),
(1012,'B', 12, 'KQ', 9),
(1013,'B', 11, 'QQ', 11),
(1014,'B', 11, 'QK', 12),
(1015,'B', 1, 'QQ', 11)
]
df1 = pd.DataFrame(first, columns=['ID', 'one', 'two', 'three','four'])
df2 = pd.DataFrame(second, columns=['ID', 'one', 'two', 'three','four'])
df1:
ID one two three four
0 1001 10 KK 5
1 1002 A 9 QK 7
2 1003 B 11 QQ 11
df2
ID one two three four
0 1004 10.5 KK 5
1 1005 9.9 KK 5
2 1006 10.0 KK 5
3 1007 10.0 KQ 5
4 1008 A 7.0 QK 9
5 1009 A 9.1 QK 7
6 1010 A 9.0 QK 7
7 1011 A 9.0 KK 7
8 1012 B 12.0 KQ 9
9 1013 B 11.0 QQ 11
10 1014 B 11.0 QK 12
11 1015 B 1.0 QQ 11
Desired output(df1):
ID one two three four count
0 1001 10 KK 5 3
1 1002 A 9 QK 7 2
2 1003 B 11 QQ 11 1
Upvotes: 0
Views: 41
Reputation: 1267
You could try this:
def condition(x):
return df2[(x.one == df2.one) & (x.three == df2.three) &
(x.four == df2.four) & (x.two > (df2.two*0.9)) &
(x.two < (df2.two*1.1))].shape[0]
df1['count'] = df1.apply(lambda x: condition(x), axis=1)
df1
ID one two three four count
0 1001 10 KK 5 3
1 1002 A 9 QK 7 2
2 1003 B 11 QQ 11 1
Upvotes: 1