Reputation: 9345
I have a Pandas DataFrame with columns A, B, C, D, date
. I want to filter out duplicates of A
and B
, keeping the row with the most recent value in date
.
So if I have two rows that look like:
A B C D date
1 1 2 3 1/1/18
1 1 2 3 1/1/17
The correct output would be:
A B C D date
1 1 2 3 1/1/18
I can do this by looping through, but I'd like to use df.groupby(['A', 'B'])
and then aggregate by taking the largest value for date
in each group.
I tried:
df.groupby(['A', 'B']).agg(lambda x: x.iloc[x.date.argmax()])
But I get:
AttributeError: 'Series' object has no attribute 'date'
Any idea what I'm doing incorrectly?
Edit: Hmm if I do:
df.groupby(['A', 'B']).UPDATED_AT.max()
I get mostly what I want but I lose columns D
and C
...
Upvotes: 3
Views: 6329
Reputation: 454
df = pd.DataFrame([[1, 1, 2, 3, '1/1/18'],
[1, 1, 2, 3, '1/1/17']],
columns=['A', 'B', 'C', 'D', 'date'])
Output:
A B C D date
0 1 1 2 3 1/1/18
1 1 1 2 3 1/1/17
Grouping an d removing duplicate:
df.groupby(['A', 'B']).agg(
{
'date': 'max'
})
Output:
date
A B
1 1 1/1/18
This should work. It may work better with having 'date' column to be datetime object.
Upvotes: 0
Reputation: 323306
You can do with
df.date=pd.to_datetime(df.date)
df.sort_values('date').drop_duplicates(['A','B'],keep='last')
A B C D date
0 1 1 2 3 2018-01-01
Upvotes: 7
Reputation: 454
Try df.groupby(['A', 'B']).agg(lambda x: x.iloc[x['date'].argmax()])
pandas has its own date object. Maybe pandas got confused with the series name.
Upvotes: 0