Grant Culp
Grant Culp

Reputation: 283

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.

I have supplied some sample tables below:

row_id group_id cust_hierarchy most_recent_date master(I am looking for)
1 0 2 2020-01-03 1
2 0 7 2019-01-01 0
3 1 7 2019-05-01 0
4 1 6 2019-04-01 0
5 1 6 2019-04-03 1

I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?

Does anyone have any helpful code for this?

Upvotes: 0

Views: 38

Answers (2)

Tom
Tom

Reputation: 8790

You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:

import pandas as pd
import numpy as np

# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
         '2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)

df = pd.DataFrame({'group_id':[0,0,1,1,1],
                   'cust_hierarchy':[2,7,7,6,6,],
                   'most_recent_date':dates})

# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()

df before:

   group_id  cust_hierarchy most_recent_date
0         0               2       2020-01-03
1         0               7       2019-01-01
2         1               7       2019-05-01
3         1               6       2019-04-01
4         1               6       2019-04-03

df after:

   group_id  cust_hierarchy most_recent_date  master
0         0               2       2020-01-03    True
1         0               7       2019-01-01   False
2         1               7       2019-05-01   False
3         1               6       2019-04-01   False
4         1               6       2019-04-03    True

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150745

Use duplicated on sort_values:

df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
                                  ascending=[False, True])
                     .duplicated('group_id', keep='last')
                     .astype(int)
                  )

Upvotes: 1

Related Questions