Reputation: 400
I have this table wherein I would like to merge two rows together where
end_date[-1]= start_date[0] AND trained[-1] = trained[0]
Optional: I also would like to retain the values of re, id, and name where diff is max.
use trained start end diff re id name
a FALSE 01/12/2010 03/01/2018 2,590 0 4 25
a TRUE 03/01/2018 08/02/2019 401 0 4 25
a TRUE 08/02/2019 09/02/2019 1 0 4 25
a TRUE 09/02/2019 31/12/2019 325 1 4 25
b FALSE 01/08/2016 15/05/2018 652 0 5 8
c FALSE 01/07/2019 06/08/2019 36 0 4 4
c TRUE 06/08/2019 18/05/2020 286 0 4 4
c TRUE 18/05/2020 19/05/2020 1 0 4 4
c TRUE 19/05/2020 01/09/2020 105 0 4 4
c TRUE 01/09/2020 31/12/2019 (245) 1 4 15
goal:
use trained start end diff re id name
a FALSE 01/12/2010 03/01/2018 2,590 0 4 25
a TRUE 03/01/2018 31/12/2019 727 0 4 25
b FALSE 01/08/2016 15/05/2018 652 0 5 8
c FALSE 01/07/2019 06/08/2019 36 0 4 4
c TRUE 06/08/2019 31/12/2019 147 0 4 4
Upvotes: 0
Views: 149
Reputation: 942
It sounds like you just want the first
value for 'start' and last
value from 'end' in your groupby:
Assuming your dataframe is called df
:
grouped = df.groupby(['use', 'trained'], as_index=False).agg({
'start': 'first',
'end': 'last'})
You can use groupby
again to get the index of the max values of 'diff' for each set of 'use' and 'trained'.
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
Now you can get the values for fields 're', 'id', 'name':
re_id_name_df = df.loc[df.index.isin(max_idx_values),
['use', 'trained', 're', 'id', 'name']]
Finally, you can merge the two results together to have everything in one dataframe:
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
Here is all the code in one block:
grouped = df.groupby(['use', 'trained'], as_index=False).agg({
'start': 'first',
'end': 'last'})
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
re_id_name_df = df.loc[df.index.isin(max_idx_values),
['use', 'trained', 're', 'id', 'name']]
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
print(final)
use trained start end re id name
0 a False 2010-01-12 2018-03-01 0 4 25
1 a True 2018-03-01 2019-12-31 0 4 25
2 b False 2016-01-08 2018-05-15 0 5 8
3 c False 2019-01-07 2019-06-08 0 4 4
4 c True 2019-06-08 2019-12-31 0 4 4
Upvotes: 1
Reputation: 150825
With your logic, we can use cumsum()
on the negated condition to identify the blocks. Then we can use groupby
:
blocks = (df['trained'].ne(df['trained'].shift())
|df['start'].ne(df['end'].shift())
).cumsum()
df.groupby([blocks,'use']).agg({ # change the functions to fit your need
'trained':'first',
'start':'first',
'end':'last',
'diff':'sum',
're':'min',
'id':'first',
'name':'first'
}).reset_index('use')
use trained start end diff re id name
1 a False 01/12/2010 03/01/2018 2,590 0 4 25
2 a True 03/01/2018 31/12/2019 4011325 0 4 25
3 b False 01/08/2016 15/05/2018 652 0 5 8
4 c False 01/07/2019 06/08/2019 36 0 4 4
5 c True 06/08/2019 31/12/2019 2861105(245) 0 4 4
Upvotes: 1