sdowling001
sdowling001

Reputation: 23

Filter pandas group with if else condition

I have a pandas dataframe like this:

ID Tier1 Tier2
1111 RF B
1111 OK B
2222 RF B
2222 RF E
3333 OK B
3333 LO B

I need to cut down the table so the IDs are unique, but do so with the following hierarchy: RF>OK>LO for Tier1. Then B>E for Tier2.

So the expected output will be:

ID Tier1 Tier2
1111 RF B
2222 RF B
2222 RF E
3333 OK B

then:

ID Tier1 Tier2
1111 RF B
2222 RF B
3333 OK B

I am struggling to figure out how to this. My initial attempt is to group the table with grouped = df.groupby('ID') and then:

grouped = df.groupby('ID')
for key, group in grouped:
    check_rf = group['Tier1']=='RF'
    check_ok = group['Tier1']=='OK'
    if check_rf.any():
        group = group[group['Tier1']=='RF']
    elif check_ok.any():
        #and so on

I think this is working to filter each group, but I have no idea how the groups can then relate back to the parent table (df). And I am sure there is a better way to do this.

Thanks!

Upvotes: 2

Views: 336

Answers (2)

Umar.H
Umar.H

Reputation: 23099

Let's use pd.Categorical & drop_duplicates

df['Tier1'] = pd.Categorical(df['Tier1'],['RF','OK','LO'],ordered=True)
df['Tier2'] = pd.Categorical(df['Tier2'],['B','E'],ordered=True)

df1 = df.sort_values(['Tier1','Tier2']).drop_duplicates(subset=['ID'],keep='first')

print(df1)

     ID Tier1 Tier2
0  1111    RF     B
2  2222    RF     B
4  3333    OK     B

Looking at Tier1 you can see the ordering.

print(df['Tier1'])

0    RF
1    OK
2    RF
3    RF
4    OK
5    LO
Name: Tier1, dtype: category
Categories (3, object): ['RF' < 'OK' < 'LO']

Upvotes: 3

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50498

You can use two groupby+agg Pandas calls. Since the ordering RF>OK>LO and B>E are respectively compliant the (reverse) lexicographic ordering, you can use the trivial min/max functions for the aggregation (otherwise you can write your own custom min-max functions).
Here is how to do that (using a 2-pass filtering):

tmp = df.groupby(['ID', 'Tier2']).agg(max).reset_index()      # Step 1
output = tmp.groupby(['ID', 'Tier1']).agg(min).reset_index()  # Step 2

Here is the result in output:

     ID Tier1 Tier2
0  1111    RF     B
1  2222    RF     B
2  3333    OK     B

Upvotes: 2

Related Questions