Reputation: 3086
I have two data frames and need to group the first one based on some criteria from the second df.
df1=
summary participant_id response_date
0 2.0 11 2016-04-30
1 3.0 11 2016-05-01
2 3.0 11 2016-05-02
3 3.0 11 2016-05-03
4 3.0 11 2016-05-04
5 3.0 11 2016-05-05
6 3.0 11 2016-05-06
7 4.0 11 2016-05-07
8 4.0 11 2016-05-08
9 3.0 11 2016-05-09
10 3.0 11 2016-05-10
11 3.0 11 2016-05-11
12 3.0 11 2016-05-12
13 3.0 11 2016-05-13
14 3.0 11 2016-05-14
15 3.0 11 2016-05-15
16 3.0 11 2016-05-16
17 4.0 11 2016-05-17
18 3.0 11 2016-05-18
19 3.0 11 2016-05-19
20 3.0 11 2016-05-20
21 4.0 11 2016-05-21
22 4.0 11 2016-05-22
23 4.0 11 2016-05-23
24 3.0 11 2016-05-24
25 3.0 11 2016-05-25
26 3.0 11 2016-05-26
27 3.0 11 2016-05-27
28 3.0 11 2016-05-28
29 3.0 11 2016-05-29
.. ... ... ...
df2 =
summary participant_id response_date
0 12.0 11 2016-04-30
1 12.0 11 2016-05-14
2 14.0 11 2016-05-28
. ... ... ...
I need to group (get blocks) of df1
between the dates in the column of df2
. Namely:
df1=
summary participant_id response_date
2.0 11 2016-04-30
3.0 11 2016-05-01
3.0 11 2016-05-02
3.0 11 2016-05-03
3.0 11 2016-05-04
3.0 11 2016-05-05
3.0 11 2016-05-06
4.0 11 2016-05-07
4.0 11 2016-05-08
3.0 11 2016-05-09
3.0 11 2016-05-10
3.0 11 2016-05-11
3.0 11 2016-05-12
3.0 11 2016-05-13
3.0 11 2016-05-14
3.0 11 2016-05-15
3.0 11 2016-05-16
4.0 11 2016-05-17
3.0 11 2016-05-18
3.0 11 2016-05-19
3.0 11 2016-05-20
4.0 11 2016-05-21
4.0 11 2016-05-22
4.0 11 2016-05-23
3.0 11 2016-05-24
3.0 11 2016-05-25
3.0 11 2016-05-26
3.0 11 2016-05-27
3.0 11 2016-05-28
3.0 11 2016-05-29
.. ... ... ...
Is there an elegant solution with groupby
?
Upvotes: 1
Views: 329
Reputation: 62037
There might be a more elegant solution but you can loop through the response_date
values in df2
and create a boolean series of values by checking against the all the response_date
values in df1
and simply summing them all up.
df1['group'] = 0
for rd in df2.response_date.values:
df1['group'] += df1.response_date > rd
Output:
summary participant_id response_date group
0 2.0 11 2016-04-30 0
1 3.0 11 2016-05-01 1
2 3.0 11 2016-05-02 1
3 3.0 11 2016-05-03 1
4 3.0 11 2016-05-04 1
Building off of @Scott's answer:
You can use pd.cut
but you will need to add a date before the earliest date and after the latest date in response_date
from df2
dates = [pd.Timestamp('2000-1-1')] +
df2.response_date.sort_values().tolist() +
[pd.Timestamp('2020-1-1')]
df1['group'] = pd.cut(df1['response_date'], dates)
Upvotes: 2
Reputation: 5549
You want the .cut
method. This lets you bin your dates by some other list of dates.
df1['cuts'] = pd.cut(df1['response_date'], df2['response_date'])
grouped = df1.groupby('cuts')
print grouped.max() #for example
Upvotes: 1