Clairton Menezes
Clairton Menezes

Reputation: 85

How to keep no grouped columns in a groupedBy in Pandas

I have a dataframe, in pandas, that I want to group by email, get the max value for date and keep the status column. But the status is not used on the groupby.

Example: give the follow dataframe df

+-------------------------------+
|  email  |  status  |   date   |
+-------------------------------+
|  test1  |  viewed  | 01/07/18 |
---------------------------------
|  test1  |not viewed| 03/07/18 | 
---------------------------------
|  test2  |not viewed| 02/07/18 |
---------------------------------
|  test2  |  viewed  | 01/07/18 | 
---------------------------------
|  test3  |not viewed| 03/07/18 |
---------------------------------
|  test3  |  viewed  | 04/07/18 |
---------------------------------

I use the following code, but I want to keep the status column, but I don't know how.

df.groupby([email]).aggregate({'date': max})

Desired output:

+-------------------------------+
|  email  |  status  |   date   |
+-------------------------------+    
|  test1  |not viewed| 03/07/18 | 
---------------------------------
|  test2  |not viewed| 02/07/18 | 
---------------------------------
|  test3  |  viewed  | 04/07/18 |
---------------------------------

In sum, I want to group by email, get the most recent date and keep the status column

Upvotes: 3

Views: 146

Answers (3)

BENY
BENY

Reputation: 323326

You can using drop_duplicates

#df['date'] = pd.to_datetime(df.date)

df.sort_values('date').drop_duplicates(['email'],keep='last')

Upvotes: 2

sacuL
sacuL

Reputation: 51395

Instead of agg You could sort by date, use a groupby, and select the last one (which will be the most recent):

df['date'] = pd.to_datetime(df.date)

df.sort_values('date').groupby('email', as_index=False).last()

   email      status       date
0  test1  not viewed 2018-03-07
1  test2  not viewed 2018-02-07
2  test3      viewed 2018-04-07

Upvotes: 3

ALollz
ALollz

Reputation: 59579

You can find the index of the maximum date for each group and then subset the original DataFrame.

import pandas as pd
df['date'] = pd.to_datetime(df.date)

df.loc[df.groupby('email').date.idxmax()]

Output:

   email      status       date
1  test1  not viewed 2018-03-07
2  test2  not viewed 2018-02-07
5  test3      viewed 2018-04-07

Upvotes: 1

Related Questions