noiivice
noiivice

Reputation: 400

pandas groupby consecutive dates and multiple conditions

I have this table wherein I would like to merge two rows together where
end_date[-1]= start_date[0] AND trained[-1] = trained[0]

Optional: I also would like to retain the values of re, id, and name where diff is max.

use trained start       end          diff   re  id  name
a   FALSE   01/12/2010  03/01/2018   2,590  0   4   25
a   TRUE    03/01/2018  08/02/2019   401    0   4   25
a   TRUE    08/02/2019  09/02/2019   1      0   4   25
a   TRUE    09/02/2019  31/12/2019   325    1   4   25
b   FALSE   01/08/2016  15/05/2018   652    0   5   8
c   FALSE   01/07/2019  06/08/2019   36     0   4   4
c   TRUE    06/08/2019  18/05/2020   286    0   4   4
c   TRUE    18/05/2020  19/05/2020   1      0   4   4
c   TRUE    19/05/2020  01/09/2020   105    0   4   4
c   TRUE    01/09/2020  31/12/2019   (245)  1   4   15

goal:

use trained start       end          diff   re  id  name
a   FALSE   01/12/2010  03/01/2018   2,590  0   4   25
a   TRUE    03/01/2018  31/12/2019   727    0   4   25
b   FALSE   01/08/2016  15/05/2018   652    0   5   8
c   FALSE   01/07/2019  06/08/2019   36     0   4   4
c   TRUE    06/08/2019  31/12/2019   147    0   4   4

Upvotes: 0

Views: 149

Answers (2)

robertwest
robertwest

Reputation: 942

It sounds like you just want the first value for 'start' and last value from 'end' in your groupby:

Assuming your dataframe is called df:

grouped = df.groupby(['use', 'trained'], as_index=False).agg({
    'start': 'first', 
    'end': 'last'})

You can use groupby again to get the index of the max values of 'diff' for each set of 'use' and 'trained'.

max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values

Now you can get the values for fields 're', 'id', 'name':

re_id_name_df = df.loc[df.index.isin(max_idx_values), 
                       ['use', 'trained', 're',  'id',  'name']]

Finally, you can merge the two results together to have everything in one dataframe:

final = grouped.merge(re_id_name_df, on=['use', 'trained'])

Here is all the code in one block:

grouped = df.groupby(['use', 'trained'], as_index=False).agg({
    'start': 'first', 
    'end': 'last'})
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
re_id_name_df = df.loc[df.index.isin(max_idx_values), 
                       ['use', 'trained', 're',  'id',  'name']]
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
print(final)

  use  trained      start        end  re  id  name
0   a    False 2010-01-12 2018-03-01   0   4    25
1   a     True 2018-03-01 2019-12-31   0   4    25
2   b    False 2016-01-08 2018-05-15   0   5     8
3   c    False 2019-01-07 2019-06-08   0   4     4
4   c     True 2019-06-08 2019-12-31   0   4     4

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150825

With your logic, we can use cumsum() on the negated condition to identify the blocks. Then we can use groupby:

blocks = (df['trained'].ne(df['trained'].shift())
          |df['start'].ne(df['end'].shift())
         ).cumsum()
df.groupby([blocks,'use']).agg({   # change the functions to fit your need
    'trained':'first',
    'start':'first',
    'end':'last',
    'diff':'sum',
    're':'min',
    'id':'first',
    'name':'first'
}).reset_index('use')

  use  trained       start         end          diff  re  id  name
1   a    False  01/12/2010  03/01/2018         2,590   0   4    25
2   a     True  03/01/2018  31/12/2019       4011325   0   4    25
3   b    False  01/08/2016  15/05/2018           652   0   5     8
4   c    False  01/07/2019  06/08/2019            36   0   4     4
5   c     True  06/08/2019  31/12/2019  2861105(245)   0   4     4

Upvotes: 1

Related Questions