Reputation: 114
Good day, I have this data set of customers with their region and Date of Birth (dob). Some customers have more than a region assigned, so I need to filter them out and keep only last region on pandas (using max value of string would be good). I tried to filter using group-by but I only manage to display customer and region, and I need to display the whole row (including dob)
Thanks in advance.
Data:
import numpy as np
import pandas as pd
data = [['A', 'FL', '2000-06-01'], ['B', 'FL', '1999-05-01'], ['C', 'FL', '2000-02-03'], ['C', 'NY', '2000-02-03'], ['D', 'FL', '1990-03-01'], ['E', 'NY', '1999-12-07'], ['E', 'TX', '1999-12-07'], ['F', 'FL', '1995-06-07'], ['F', 'TX', '1995-06-07'],]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['customer', 'region', 'dob'])
customer region dob
0 A FL 2000-06-01
1 B FL 1999-05-01
2 C FL 2000-02-03
3 C NY 2000-02-03
4 D FL 1990-03-01
5 E NY 1999-12-07
6 E TX 1999-12-07
7 F FL 1995-06-07
8 F TX 1995-06-07
My take:
df.groupby(['customer'], sort=False)['region'].max()
Partial Output
customer
A FL
B FL
C NY
D FL
E TX
F TX
Desired output:
customer region dob
0 A FL 2000-06-01
1 B FL 1999-05-01
2 C NY 2000-02-03
3 D FL 1990-03-01
4 E TX 1999-12-07
5 F TX 1995-06-07
Upvotes: 1
Views: 1293
Reputation: 378
You can do a drop duplicates instead of grouping. Use this code
df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)
The reset index is just to re-number the index after duplicates are removed. The output will be as follows:
In [12]: df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)
Out[12]:
customer region dob
0 A FL 2000-06-01
1 B FL 1999-05-01
2 C NY 2000-02-03
3 D FL 1990-03-01
4 E TX 1999-12-07
5 F TX 1995-06-07
Upvotes: 1
Reputation: 28699
I think what you need is the tail
-> it returns the last row(s) for each grouping, in this case you need just the last (1) row per grouping:
df.groupby('customer').tail(1)
Out[279]:
customer region dob
0 A FL 2000-06-01
1 B FL 1999-05-01
3 C NY 2000-02-03
4 D FL 1990-03-01
6 E TX 1999-12-07
8 F TX 1995-06-07
Upvotes: 1
Reputation: 3591
If all the values other than region are the same for each customer, you can use df.groupby('customer').max()
; it's the ['region']
part that's restricting the columns to just region. (Also, you can use just use customer
, rather than a list containing customer
). Also, max
returns the alphabetically last element. If you want the value from the last row, you'll need something different.
Upvotes: 1