Reputation: 283
I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id | group_id | cust_hierarchy | most_recent_date | master(I am looking for) |
---|---|---|---|---|
1 | 0 | 2 | 2020-01-03 | 1 |
2 | 0 | 7 | 2019-01-01 | 0 |
3 | 1 | 7 | 2019-05-01 | 0 |
4 | 1 | 6 | 2019-04-01 | 0 |
5 | 1 | 6 | 2019-04-03 | 1 |
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
Upvotes: 0
Views: 38
Reputation: 8790
You basically can to an groupby
with an idxmin()
, but with a little bit of sorting to ensure the most recent use date is selected by the min
operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df
before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df
after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Upvotes: 3
Reputation: 150745
Use duplicated
on sort_values
:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)
Upvotes: 1