Reputation: 275
I have a Pandas dataframe that looks like this:
and I want to grab for each distinct ID, the row with the max date so that my final results looks something like this:
My date column is of data type 'object'. I have tried grouping and then trying to grab the max like the following:
idx = df.groupby(['ID','Item'])['date'].transform(max) == df_Trans['date']
df_new = df[idx]
However I am unable to get the desired result.
Upvotes: 14
Views: 22384
Reputation: 71
My answer is a generalization of piRSquared's answer:
manykey
indicates the keys from which the mapping is desired (many-to)
onekey
indicates the keys to which the mapping is desired (-to-one)
sortkey
is sortable key and it follows asc
set to True (as python standard)
def get_last(df:pd.DataFrame,manykey:list[str],onekey:list[str],sortkey,asc=True):
return df.sort_values(sortkey,asc).drop_duplicates(subset=manykey, keep='last')[manykey+onekey]
In your case the answer should be:
get_last(df,["id"],["item"],"date")
Note that I am using the onekey
explicitly because I want to drop the rest of the keys (if they are in the table) and create a mapping.
Upvotes: 0
Reputation: 21
The last bit of code from piRSquared's answer is wrong.
We are trying to get distinct IDs, so the column used in drop_duplicates should be 'ID'. keep='last' would then retrieve the last (and max) date for each ID.
df.sort_values(['ID', 'date']).drop_duplicates('ID', keep='last')
Upvotes: 2
Reputation: 294218
idxmax
Should work so long as index
is unique or the maximal index isn't repeated.
df.loc[df.groupby('ID').date.idxmax()]
Should work as long as maximal values are unique. Otherwise, you'll get all rows equal to the maximum.
df[df.groupby('ID')['date'].transform('max') == df['date']]
And also very good solution.
df.sort_values(['ID', 'date']).drop_duplicates('date', keep='last')
Upvotes: 21