Reputation: 2723
I have an example DataFrame like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,2,2,2,3,3,], 'date':array(['2000-01-01','2002-01-01','2010-01-01','2003-01-01','2004-01-01','2008-01-01'],dtype='datetime64[D]')})
I am trying to get the 2nd earliest day in each ID group. So I wrote the following funciton:
def f(x):
if len(x)==1:
return x[0]
else:
x.sort()
return x[1]
And then I wrote:
df.groupby('ID').date.apply(lambda x:f(x))
The result is an error.
Could you find a way to make this work?
Upvotes: 6
Views: 10887
Reputation: 20822
You may not want to return the first and only value as the second value as in the accepted answer (i.e., 2000-01-01
is not the second value, but the only value). If this is the case, you can rank each group and be able to select the first, second, third, etc. smallest value more generically:
df['rank'] = df.sort_values('date').groupby('ID').cumcount()+1
For the second smallest value:
df[df['rank'] == 2]
this returns
ID date rank
2 2003-01-01 2
3 2008-01-01 2
Upvotes: 1
Reputation: 128948
This requires 0.14.1. And will be quite efficient, especially if you have large groups (as this doesn't require fully sorting them).
In [32]: df.groupby('ID')['date'].nsmallest(2)
Out[32]:
ID
1 0 2000-01-01
2 1 2002-01-01
3 2003-01-01
3 4 2004-01-01
5 2008-01-01
dtype: datetime64[ns]
In [33]: df.groupby('ID')['date'].nsmallest(2).groupby(level='ID').last()
Out[33]:
ID
1 2000-01-01
2 2003-01-01
3 2008-01-01
dtype: datetime64[ns]
Upvotes: 7
Reputation: 52236
Take a look at the indexing docs - in general pandas defaults to indexing by label, rather than location - that's why you get a KeyError
.
In your particular case you could use .iloc
for location based indexing.
In [266]: def f(x):
...: if len(x)==1:
...: return x.iloc[0]
...: else:
...: x.sort()
...: return x.iloc[1]
...:
In [267]: df.groupby('ID').date.apply(f)
Out[267]:
ID
1 2000-01-01
2 2003-01-01
3 2008-01-01
Name: date, dtype: datetime64[ns]
Upvotes: 3