Complex Subset of Pandas Dataframe

Question

I have a dataframe where I have various values in one column. I want to make sure that I have the 3 most recent entries for each unique column element in the dataframe.

I have:

      Group      Date        Value
        A        10/1/2017    4.4
        A        8/3/2017     9.2
        A        5/3/2017     8.4
        A        4/4/2017     4.4
        B        12/1/2015    4.4
        B        8/3/2015     9.2
        B        5/4/2015     8.4
        B        4/5/2015     4.4

and I want:

      Group      Date        Value
        A        10/1/2017    4.4
        A        8/3/2017     9.2
        A        5/3/2017     8.4
        B        12/1/2015    4.4
        B        8/3/2015     9.2
        B        5/4/2015     8.4

The dates are datetimes. I didn't know how to represent otherwise in this post.

I can get this with the following code:

new_df = pd.DataFrame()
for group in df['Group'].unique():

    temp_df = df[df['Group'] == group]
    temp_df = temp_df[0:3]

    if new_df.empty:

       new_df = temp_df

    else:

       new_df.append(temp_df)

Is there a more pythonic way to do this?

Thanks in advance.

rje · Accepted Answer

What about the following:

df.sort_values(by='Date', ascending=False).groupby('Group').head(3)

On my little testing dataset this returns the following (I use head(2)):

    Group   Date
1   c   2050-01-01
8   a   2032-02-03
0   a   2030-01-01
9   c   2029-01-01
10  b   2018-01-01
2   b   2017-02-03

And as you can see the values for the various groups are not nicely packed together anymore. We can fix this by sorting first on 'Group' and then on 'Date':

df.sort_values(by=['Group', 'Date'], ascending=[True,False]).groupby('Group').head(3)


    Group   Date
8   a   2032-02-03
0   a   2030-01-01
10  b   2018-01-01
2   b   2017-02-03
1   c   2050-01-01
9   c   2029-01-01

Complex Subset of Pandas Dataframe

Answers (2)

Related Questions