henrywongkk
henrywongkk

Reputation: 1918

How to convert date string to datetime in agg function during groupby

Given a dataframe, how to groupby Item with max value of Updated Date (as datetime instead of date string) while keeping date string format in result dataframe ?

df = pd.DataFrame([['A', 10, 'Jun 12, 2019 06:16 PM'],
                  ['A', 20, 'Jul 26, 2019 10:56 AM'],
                  ['B', 30, 'May 20, 2019 05:54 PM'],
                  ['B', 40, 'Apr 28, 2019 06:42 PM']],
                  columns=['Item', 'Quantity', 'Updated Date'])
>>> df
  Item  Quantity           Updated Date
0    A        10  Jun 12, 2019 06:16 PM
1    A        20  Jul 26, 2019 10:56 AM
2    B        30  May 20, 2019 05:54 PM
3    B        40  Apr 28, 2019 06:42 PM

Expected Output

  Item  Quantity           Updated Date
0    A        30  Jul 26, 2019 10:56 AM
1    B        70  May 20, 2019 05:54 PM

What I have tried

If I put 'Updated Date': max in the agg(), it would simply return the max value in alphabetical order

>>> df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': max})
  Item  Quantity           Updated Date
0    A        30  Jun 12, 2019 06:16 PM   # expected to be Jul 26
1    B        70  May 20, 2019 05:54 PM

If I apply pd.to_datetime() it gives me a closer result but the date string format is distorted

df['Updated Date'] = pd.to_datetime(df['Updated Date'])
df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': max})
  Item  Quantity        Updated Date
0    A        30 2019-07-26 10:56:00
1    B        70 2019-05-20 17:54:00

Is it possible to apply pd.to_datetime() only during groupby ? The challenge here is datetime format is not guaranteed to be '%b %d, %Y %I:%M %p' while I want to keep the date string as is in the result.

Upvotes: 2

Views: 908

Answers (3)

Quang Hoang
Quang Hoang

Reputation: 150785

Convert to datetime and use idxmax is the way to go. You don't need to change your dataframe, for example:

# these are the index of the max dates
s = pd.to_datetime(df['Updated Date']).groupby(df['Item']).idxmax()

(df.groupby('Item')[['Quantity']].sum()        # get the sum of quantity
   .merge(df.loc[s, ['Item','Updated Date']],  # merge with the original rows for dates
          on='Item'                            # on the Item of course
          )
)

Output:

  Item  Quantity           Updated Date
0    A        30  Jul 26, 2019 10:56 AM
1    B        70  May 20, 2019 05:54 PM

Upvotes: 1

Dev Khadka
Dev Khadka

Reputation: 5461

you can do it using lambda like below

df = pd.DataFrame([['A', 10, 'Jun 12, 2019 06:16 PM'],
                  ['A', 20, 'Jul 26, 2019 10:56 AM'],
                  ['B', 30, 'May 20, 2019 05:54 PM'],
                  ['B', 40, 'Apr 28, 2019 06:42 PM']],
                  columns=['Item', 'Quantity', 'Updated Date'])

df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date': lambda g: g.loc[pd.to_datetime(df["Updated Date"]).idxmax()]})

or not using apply function like

df["Updated Date 2"] = pd.to_datetime(df["Updated Date"])
result = df.groupby(['Item'], as_index=False).agg({'Quantity': sum, 'Updated Date 2': "idxmax"})
result["Updated Date"] = df["Updated Date"].loc[result["Updated Date 2"]].values
result.drop(columns="Updated Date 2", inplace=True)
result

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

Use the datetime just as reference for your comparisons.

g = (df.assign(date=pd.to_datetime(df['Updated Date']))
    .groupby('Item')
    .agg({'Quantity': 'sum', 'date': 'idxmax'}))

g['Updated Date'] = df.loc[g.date, 'Updated Date'].tolist()

      Quantity           Updated Date
Item                                 
A           30  Jul 26, 2019 10:56 AM
B           70  May 20, 2019 05:54 PM

Upvotes: 2

Related Questions