Ach113
Ach113

Reputation: 1825

Getting a slice from pandas dataframe by comparing contents of df1 with df2

I have two dataframes, for each id in df1 I need to pick rows from df2 with same id, filter out rows where df2.ApplicationDate < df1.ApplicationDate and count how many of such rows exist.

Here is how I am doing it currently:

for i, row in df1.iterrows():
    count = len(df2[(df2['PersonalId']==row['PersonalId']) 
                        & (df2['ApplicationDate'] < row['ApplicationDate']])
    counts.append(count)

This approach works but its hellishly slow on large dataframes, is there any way to accelerate it?

Edit: added sample input with expected output

df1:                  
   Id ApplicationDate
0   1         5-12-20
1   2         6-12-20
2   3         7-12-20
3   4         8-12-20
4   5         9-12-20
5   6        10-12-20
df2:
   Id ApplicationDate
0   1         4-11-20
1   1         4-12-20
2   3         5-12-20
3   3         8-12-20
4   5         1-12-20

expected counts for each id: [2, 0, 1, 0, 1, 0]

Upvotes: 1

Views: 37

Answers (2)

sammywemmy
sammywemmy

Reputation: 28709

df1.merge(df2, on="Id", how="left").assign(
    temp=lambda x: x.ApplicationDate_y.notna(),
    tempr=lambda x: x.ApplicationDate_x > x.ApplicationDate_y,
    counter=lambda x: x.temp & x.tempr,
).groupby("Id").counter.sum()

Id
1    2
2    0
3    1
4    0
5    1
6    0
Name: counter, dtype: int64

The code above merges the dataframe and then uses the sum of the conditions based on the groupby to get the count.

Upvotes: 2

Mykola Zotko
Mykola Zotko

Reputation: 17884

You can can left join both tables:

df3 = df1.merge(df2, on='Id', how='left')

Result:

   Id ApplicationDate_x ApplicationDate_y
0   1        2020-05-12        2020-04-11
1   1        2020-05-12        2020-04-12
2   2        2020-06-12               NaT
3   3        2020-07-12        2020-05-12
4   3        2020-07-12        2020-08-12
5   4        2020-08-12               NaT
6   5        2020-09-12        2020-01-12
7   6        2020-10-12               NaT

Then you can compare dates, group by 'Id' and count True values per group:

df3.ApplicationDate_x.gt(df3.ApplicationDate_y).groupby(df3.Id).sum()

Result:

Id
1    2
2    0
3    1
4    0
5    1
6    0

Upvotes: 3

Related Questions