Reputation: 39
I wanted to groupby and sum the features according to the index date from 2 separate data frames.
What I mean by that is, let us suppose I have the following data frame:
MemberID | indexDate | yValue |
---|---|---|
a | 2020-11-02 | 2 |
a | 2019-02-03 | 3 |
b | 2018-02-03 | 2 |
a | 2020-02-03 | 9 |
Now, initially, I created another row with the look-up date (the time period where I want to sum the features from) with the following code:
df['lookupDate'] = df['indexDate'] - pd.DateOffset(years=1)
And the following data frame was obtained:
MemberID | indexDate | lookupDate | yValue |
---|---|---|---|
a | 2020-11-02 | 2019-11-02 | 2 |
a | 2019-02-03 | 2018-02-03 | 3 |
b | 2018-02-03 | 2017-02-03 | 2 |
a | 2020-02-03 | 2019-02-03 | 9 |
I have another data frame that has feature values that I want to merge with the above data frame. The feature data frame looks something like this:
MemberID | serviceDate | feature1 | feature2 | feature3 |
---|---|---|---|---|
a | 2020-09-02 | 1 | 1 | 0 |
a | 2019-03-03 | 0 | 1 | 1 |
b | 2018-05-03 | 0 | 0 | 1 |
a | 2020-06-03 | 1 | 0 | 0 |
Now, I want to merge such that, I am able to look at each row of the first data frame and consider the lookupDate as the cycle-start-date and the indexDate as the cycle-end-date for the second data frame.
By this, I mean that for MemberID, 'a', for the first data frame, the cycle-start-date would be '2019-11-02' and the cycle-end-date would be '2020-11-02'. Now, using this I will look into the second data frame and groupby MemberID and sum up the features for MemberID, 'a', such that it is groupby and summed up within the cycle-start and cycle-end date.
Here, from the second data frame, the first and the last row for MemberID 'a' would be selected because the serviceDate lies between '2019-11-02' and '2020-11-02'. So, for the first row of the first data frame I want a result like:
MemberID | indexDate | lookupDate | feature1 | feature2 | feature3 | yValue |
---|---|---|---|---|---|---|
a | 2020-11-02 | 2019-11-02 | 2 | 1 | 0 | 2 |
Now, I want to do this for each row with a different cycle-start and cycle-end dates.
Upvotes: 1
Views: 230
Reputation: 862451
Use DataFrame.merge
with filtering by boolean indexing
with Series.between
and then aggregate sum
:
df2 = df1.merge(df, on='MemberID', how='outer')
df2 = df1.merge(df, on='MemberID', how='outer')
df2 = (df2[df2['serviceDate'].between(df2['lookupDate'], df2['indexDate'])]
.groupby(['MemberID','indexDate','lookupDate','yValue'], as_index=False).sum())
df2['yValue'] = df2.pop('yValue')
print (df2)
MemberID indexDate lookupDate feature1 feature2 feature3 yValue
0 a 2020-02-03 2019-02-03 0 1 1 9
1 a 2020-11-02 2019-11-02 2 1 0 2
Upvotes: 1