midtownguru
midtownguru

Reputation: 2723

Python pandas: select 2nd smallest value in groupby

I have an example DataFrame like the following:

import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,2,2,2,3,3,], 'date':array(['2000-01-01','2002-01-01','2010-01-01','2003-01-01','2004-01-01','2008-01-01'],dtype='datetime64[D]')})

I am trying to get the 2nd earliest day in each ID group. So I wrote the following funciton:

def f(x):
    if len(x)==1:
        return x[0]
    else:
        x.sort()
        return x[1]

And then I wrote:

df.groupby('ID').date.apply(lambda x:f(x))

The result is an error.

Could you find a way to make this work?

Upvotes: 6

Views: 10887

Answers (3)

scottlittle
scottlittle

Reputation: 20822

You may not want to return the first and only value as the second value as in the accepted answer (i.e., 2000-01-01 is not the second value, but the only value). If this is the case, you can rank each group and be able to select the first, second, third, etc. smallest value more generically:

df['rank'] = df.sort_values('date').groupby('ID').cumcount()+1

For the second smallest value:

df[df['rank'] == 2]

this returns

ID  date        rank
2   2003-01-01  2
3   2008-01-01  2

Upvotes: 1

Jeff
Jeff

Reputation: 128948

This requires 0.14.1. And will be quite efficient, especially if you have large groups (as this doesn't require fully sorting them).

In [32]: df.groupby('ID')['date'].nsmallest(2)
Out[32]: 
ID   
1   0   2000-01-01
2   1   2002-01-01
    3   2003-01-01
3   4   2004-01-01
    5   2008-01-01
dtype: datetime64[ns]

In [33]: df.groupby('ID')['date'].nsmallest(2).groupby(level='ID').last()
Out[33]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
dtype: datetime64[ns]

Upvotes: 7

chrisb
chrisb

Reputation: 52236

Take a look at the indexing docs - in general pandas defaults to indexing by label, rather than location - that's why you get a KeyError.

In your particular case you could use .iloc for location based indexing.

In [266]: def f(x):
     ...:     if len(x)==1:
     ...:         return x.iloc[0]
     ...:     else:
     ...:         x.sort()
     ...:         return x.iloc[1]
     ...:     

In [267]: df.groupby('ID').date.apply(f)
Out[267]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
Name: date, dtype: datetime64[ns]

Upvotes: 3

Related Questions