Pandas groupBy with conditional grouping

Question

I have two data frames and need to group the first one based on some criteria from the second df.

df1= 
     summary  participant_id response_date
0        2.0              11    2016-04-30
1        3.0              11    2016-05-01
2        3.0              11    2016-05-02
3        3.0              11    2016-05-03
4        3.0              11    2016-05-04
5        3.0              11    2016-05-05
6        3.0              11    2016-05-06
7        4.0              11    2016-05-07
8        4.0              11    2016-05-08
9        3.0              11    2016-05-09
10       3.0              11    2016-05-10
11       3.0              11    2016-05-11
12       3.0              11    2016-05-12
13       3.0              11    2016-05-13
14       3.0              11    2016-05-14
15       3.0              11    2016-05-15
16       3.0              11    2016-05-16
17       4.0              11    2016-05-17
18       3.0              11    2016-05-18
19       3.0              11    2016-05-19
20       3.0              11    2016-05-20
21       4.0              11    2016-05-21
22       4.0              11    2016-05-22
23       4.0              11    2016-05-23
24       3.0              11    2016-05-24
25       3.0              11    2016-05-25
26       3.0              11    2016-05-26
27       3.0              11    2016-05-27
28       3.0              11    2016-05-28
29       3.0              11    2016-05-29
..       ...             ...           ... 

df2 =
    summary  participant_id response_date
0      12.0              11    2016-04-30
1      12.0              11    2016-05-14
2      14.0              11    2016-05-28
.       ...             ...           ...

I need to group (get blocks) of df1 between the dates in the column of df2. Namely:

df1= 
         summary  participant_id response_date
             2.0              11    2016-04-30

             3.0              11    2016-05-01
             3.0              11    2016-05-02
             3.0              11    2016-05-03
             3.0              11    2016-05-04
             3.0              11    2016-05-05
             3.0              11    2016-05-06
             4.0              11    2016-05-07
             4.0              11    2016-05-08
             3.0              11    2016-05-09
             3.0              11    2016-05-10
             3.0              11    2016-05-11
             3.0              11    2016-05-12
             3.0              11    2016-05-13
             3.0              11    2016-05-14

             3.0              11    2016-05-15
             3.0              11    2016-05-16
             4.0              11    2016-05-17
             3.0              11    2016-05-18
             3.0              11    2016-05-19
             3.0              11    2016-05-20
             4.0              11    2016-05-21
             4.0              11    2016-05-22
             4.0              11    2016-05-23
             3.0              11    2016-05-24
             3.0              11    2016-05-25
             3.0              11    2016-05-26
             3.0              11    2016-05-27
             3.0              11    2016-05-28

             3.0              11    2016-05-29
    ..       ...             ...           ...

Is there an elegant solution with groupby?

Ted Petrou · Accepted Answer

There might be a more elegant solution but you can loop through the response_date values in df2 and create a boolean series of values by checking against the all the response_date values in df1 and simply summing them all up.

df1['group'] = 0
for rd in df2.response_date.values:
    df1['group'] += df1.response_date > rd

Output:

   summary  participant_id response_date  group
0      2.0              11    2016-04-30      0
1      3.0              11    2016-05-01      1
2      3.0              11    2016-05-02      1
3      3.0              11    2016-05-03      1
4      3.0              11    2016-05-04      1

Building off of @Scott's answer:

You can use pd.cut but you will need to add a date before the earliest date and after the latest date in response_date from df2

dates = [pd.Timestamp('2000-1-1')] + 
         df2.response_date.sort_values().tolist() + 
        [pd.Timestamp('2020-1-1')]
df1['group'] = pd.cut(df1['response_date'], dates)

Pandas groupBy with conditional grouping

Answers (2)

Related Questions