Pandas groupby and sum according to an index date from two data frames

Question

I wanted to groupby and sum the features according to the index date from 2 separate data frames.

What I mean by that is, let us suppose I have the following data frame:

MemberID	indexDate	yValue
a	2020-11-02	2
a	2019-02-03	3
b	2018-02-03	2
a	2020-02-03	9

Now, initially, I created another row with the look-up date (the time period where I want to sum the features from) with the following code:

df['lookupDate'] = df['indexDate'] - pd.DateOffset(years=1)

And the following data frame was obtained:

MemberID	indexDate	lookupDate	yValue
a	2020-11-02	2019-11-02	2
a	2019-02-03	2018-02-03	3
b	2018-02-03	2017-02-03	2
a	2020-02-03	2019-02-03	9

I have another data frame that has feature values that I want to merge with the above data frame. The feature data frame looks something like this:

MemberID	serviceDate	feature1	feature2	feature3
a	2020-09-02	1	1	0
a	2019-03-03	0	1	1
b	2018-05-03	0	0	1
a	2020-06-03	1	0	0

Now, I want to merge such that, I am able to look at each row of the first data frame and consider the lookupDate as the cycle-start-date and the indexDate as the cycle-end-date for the second data frame.

By this, I mean that for MemberID, 'a', for the first data frame, the cycle-start-date would be '2019-11-02' and the cycle-end-date would be '2020-11-02'. Now, using this I will look into the second data frame and groupby MemberID and sum up the features for MemberID, 'a', such that it is groupby and summed up within the cycle-start and cycle-end date.

Here, from the second data frame, the first and the last row for MemberID 'a' would be selected because the serviceDate lies between '2019-11-02' and '2020-11-02'. So, for the first row of the first data frame I want a result like:

MemberID	indexDate	lookupDate	feature1	feature2	feature3	yValue
a	2020-11-02	2019-11-02	2	1	0	2

Now, I want to do this for each row with a different cycle-start and cycle-end dates.

jezrael · Accepted Answer

Use DataFrame.merge with filtering by boolean indexing with Series.between and then aggregate sum:

df2 = df1.merge(df, on='MemberID', how='outer')

df2 = df1.merge(df, on='MemberID', how='outer')

df2 = (df2[df2['serviceDate'].between(df2['lookupDate'], df2['indexDate'])]
          .groupby(['MemberID','indexDate','lookupDate','yValue'], as_index=False).sum())
df2['yValue'] = df2.pop('yValue')
print (df2)
  MemberID  indexDate lookupDate  feature1  feature2  feature3  yValue
0        a 2020-02-03 2019-02-03         0         1         1       9
1        a 2020-11-02 2019-11-02         2         1         0       2

Pandas groupby and sum according to an index date from two data frames

Answers (1)

Related Questions