Mauro Del Nook
Mauro Del Nook

Reputation: 114

Filter data using max. categorical value of a group in pandas

Good day, I have this data set of customers with their region and Date of Birth (dob). Some customers have more than a region assigned, so I need to filter them out and keep only last region on pandas (using max value of string would be good). I tried to filter using group-by but I only manage to display customer and region, and I need to display the whole row (including dob)

Thanks in advance.

Data:

import numpy as np
import pandas as pd 
data = [['A', 'FL', '2000-06-01'], ['B', 'FL', '1999-05-01'], ['C', 'FL', '2000-02-03'], ['C', 'NY', '2000-02-03'], ['D', 'FL', '1990-03-01'], ['E', 'NY', '1999-12-07'], ['E', 'TX', '1999-12-07'], ['F', 'FL', '1995-06-07'], ['F', 'TX', '1995-06-07'],]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['customer', 'region', 'dob']) 

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   FL  2000-02-03
3   C   NY  2000-02-03
4   D   FL  1990-03-01
5   E   NY  1999-12-07
6   E   TX  1999-12-07
7   F   FL  1995-06-07
8   F   TX  1995-06-07

My take:

df.groupby(['customer'], sort=False)['region'].max()

Partial Output

customer
A    FL
B    FL
C    NY
D    FL
E    TX
F    TX

Desired output:

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   NY  2000-02-03
3   D   FL  1990-03-01
4   E   TX  1999-12-07
5   F   TX  1995-06-07

Upvotes: 1

Views: 1293

Answers (3)

aj7amigo
aj7amigo

Reputation: 378

You can do a drop duplicates instead of grouping. Use this code

df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)

The reset index is just to re-number the index after duplicates are removed. The output will be as follows:

In [12]: df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)
Out[12]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
2        C     NY  2000-02-03
3        D     FL  1990-03-01
4        E     TX  1999-12-07
5        F     TX  1995-06-07

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28699

I think what you need is the tail -> it returns the last row(s) for each grouping, in this case you need just the last (1) row per grouping:

df.groupby('customer').tail(1)
Out[279]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
3        C     NY  2000-02-03
4        D     FL  1990-03-01
6        E     TX  1999-12-07
8        F     TX  1995-06-07

Upvotes: 1

Acccumulation
Acccumulation

Reputation: 3591

If all the values other than region are the same for each customer, you can use df.groupby('customer').max(); it's the ['region'] part that's restricting the columns to just region. (Also, you can use just use customer, rather than a list containing customer). Also, max returns the alphabetically last element. If you want the value from the last row, you'll need something different.

Upvotes: 1

Related Questions