Reputation: 1918
Given a dataframe, how to groupby Item
with max
value of Updated Date
(as datetime
instead of date string) while keeping date string format in result dataframe ?
df = pd.DataFrame([['A', 10, 'Jun 12, 2019 06:16 PM'],
['A', 20, 'Jul 26, 2019 10:56 AM'],
['B', 30, 'May 20, 2019 05:54 PM'],
['B', 40, 'Apr 28, 2019 06:42 PM']],
columns=['Item', 'Quantity', 'Updated Date'])
>>> df
Item Quantity Updated Date
0 A 10 Jun 12, 2019 06:16 PM
1 A 20 Jul 26, 2019 10:56 AM
2 B 30 May 20, 2019 05:54 PM
3 B 40 Apr 28, 2019 06:42 PM
Expected Output
Item Quantity Updated Date
0 A 30 Jul 26, 2019 10:56 AM
1 B 70 May 20, 2019 05:54 PM
What I have tried
If I put 'Updated Date': max
in the agg()
, it would simply return the max value in alphabetical order
>>> df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': max})
Item Quantity Updated Date
0 A 30 Jun 12, 2019 06:16 PM # expected to be Jul 26
1 B 70 May 20, 2019 05:54 PM
If I apply pd.to_datetime()
it gives me a closer result but the date string format is distorted
df['Updated Date'] = pd.to_datetime(df['Updated Date'])
df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': max})
Item Quantity Updated Date
0 A 30 2019-07-26 10:56:00
1 B 70 2019-05-20 17:54:00
Is it possible to apply pd.to_datetime()
only during groupby
? The challenge here is datetime format is not guaranteed to be '%b %d, %Y %I:%M %p' while I want to keep the date string as is in the result.
Upvotes: 2
Views: 908
Reputation: 150785
Convert to datetime and use idxmax
is the way to go. You don't need to change your dataframe, for example:
# these are the index of the max dates
s = pd.to_datetime(df['Updated Date']).groupby(df['Item']).idxmax()
(df.groupby('Item')[['Quantity']].sum() # get the sum of quantity
.merge(df.loc[s, ['Item','Updated Date']], # merge with the original rows for dates
on='Item' # on the Item of course
)
)
Output:
Item Quantity Updated Date
0 A 30 Jul 26, 2019 10:56 AM
1 B 70 May 20, 2019 05:54 PM
Upvotes: 1
Reputation: 5461
you can do it using lambda like below
df = pd.DataFrame([['A', 10, 'Jun 12, 2019 06:16 PM'],
['A', 20, 'Jul 26, 2019 10:56 AM'],
['B', 30, 'May 20, 2019 05:54 PM'],
['B', 40, 'Apr 28, 2019 06:42 PM']],
columns=['Item', 'Quantity', 'Updated Date'])
df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': lambda g: g.loc[pd.to_datetime(df["Updated Date"]).idxmax()]})
or not using apply function like
df["Updated Date 2"] = pd.to_datetime(df["Updated Date"])
result = df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date 2': "idxmax"})
result["Updated Date"] = df["Updated Date"].loc[result["Updated Date 2"]].values
result.drop(columns="Updated Date 2", inplace=True)
result
Upvotes: 0
Reputation: 59274
Use the datetime
just as reference for your comparisons.
g = (df.assign(date=pd.to_datetime(df['Updated Date']))
.groupby('Item')
.agg({'Quantity': 'sum', 'date': 'idxmax'}))
g['Updated Date'] = df.loc[g.date, 'Updated Date'].tolist()
Quantity Updated Date
Item
A 30 Jul 26, 2019 10:56 AM
B 70 May 20, 2019 05:54 PM
Upvotes: 2