Reputation: 1825
I have two dataframes, for each id
in df1
I need to pick rows from df2
with same id
, filter out rows where df2.ApplicationDate < df1.ApplicationDate
and count how many of such rows exist.
Here is how I am doing it currently:
for i, row in df1.iterrows():
count = len(df2[(df2['PersonalId']==row['PersonalId'])
& (df2['ApplicationDate'] < row['ApplicationDate']])
counts.append(count)
This approach works but its hellishly slow on large dataframes, is there any way to accelerate it?
Edit: added sample input with expected output
df1:
Id ApplicationDate
0 1 5-12-20
1 2 6-12-20
2 3 7-12-20
3 4 8-12-20
4 5 9-12-20
5 6 10-12-20
df2:
Id ApplicationDate
0 1 4-11-20
1 1 4-12-20
2 3 5-12-20
3 3 8-12-20
4 5 1-12-20
expected counts
for each id: [2, 0, 1, 0, 1, 0]
Upvotes: 1
Views: 37
Reputation: 28709
df1.merge(df2, on="Id", how="left").assign(
temp=lambda x: x.ApplicationDate_y.notna(),
tempr=lambda x: x.ApplicationDate_x > x.ApplicationDate_y,
counter=lambda x: x.temp & x.tempr,
).groupby("Id").counter.sum()
Id
1 2
2 0
3 1
4 0
5 1
6 0
Name: counter, dtype: int64
The code above merges the dataframe and then uses the sum of the conditions based on the groupby to get the count.
Upvotes: 2
Reputation: 17884
You can can left join both tables:
df3 = df1.merge(df2, on='Id', how='left')
Result:
Id ApplicationDate_x ApplicationDate_y
0 1 2020-05-12 2020-04-11
1 1 2020-05-12 2020-04-12
2 2 2020-06-12 NaT
3 3 2020-07-12 2020-05-12
4 3 2020-07-12 2020-08-12
5 4 2020-08-12 NaT
6 5 2020-09-12 2020-01-12
7 6 2020-10-12 NaT
Then you can compare dates, group by 'Id'
and count True
values per group:
df3.ApplicationDate_x.gt(df3.ApplicationDate_y).groupby(df3.Id).sum()
Result:
Id
1 2
2 0
3 1
4 0
5 1
6 0
Upvotes: 3