Reputation: 3083
I have the following dataframe:
obj_id data_date value
0 4 2011-11-01 59500
1 2 2011-10-01 35200
2 4 2010-07-31 24860
3 1 2009-07-28 15860
4 2 2008-10-15 200200
I want to get a subset of this data so that I only have the most recent (largest 'data_date'
) 'value'
for each 'obj_id'
.
I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.
My method is essentially to group, sort, retrieve, and recombine as follows:
row_arr = []
for grp, grp_df in df.groupby('obj_id'):
row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0])
df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value'))
Upvotes: 28
Views: 25456
Reputation: 2162
df1 = pd.DataFrame({
'Id': ['00', '01', '02', '02', '01', '03'] ,
'date': ['1990-12-31 ','1990-12-27 ','1990-12-28 ',
'1990-12-28 ','1992-12-27 ','1990-12-30 '] ,
'Population': ['700','200','300','400','500','100']
})
print(df1)
"""
Id date Population
0 00 1990-12-31 700
1 01 1990-12-27 200
2 02 1990-12-28 300
3 02 1990-12-28 400
4 01 1992-12-27 500
5 03 1990-12-30 100
"""
Max1 = df1.groupby('Id').apply( lambda df : df['Population'].values[df['Population'].values.argmax()] )
print(Max1)
"""
Id
00 700
01 500
02 400
03 100
dtype: object
"""
Min1 = df1.groupby('Id').apply(lambda df : df['Population'].values[df['Population'].values.argmin()])
print(Min1)
"""
Id
00 700
01 200
02 300
03 100
dtype: object
"""
METHOD 2 :
cc = df1.sort_values('Population', ascending=False).drop_duplicates(['Id'])
print(cc)
"""
Id date Population
0 00 1990-12-31 700
4 01 1992-12-27 500
3 02 1990-12-28 400
5 03 1990-12-30 100
"""
METHOD 3 :
aa = df1.groupby(['Id'],sort = False)['Population'].max()
print(aa)
"""
Id
00 700
01 500
02 400
03 100
Name: Population, dtype: object
"""
METHOD 4:
res = df1.groupby(['Id'])['Population'].transform(max) == df1['Population']
print(df1[res])
"""
Id date Population
0 00 1990-12-31 700
3 02 1990-12-28 400
4 01 1992-12-27 500
5 03 1990-12-30 100
"""
Upvotes: 1
Reputation: 181
This is another possible solution. Dont know if this is the fastest (I doubt..) since I have not benchmarked it against other approaches.
df.loc[df.groupby('obj_id').data_date.idxmax(),:]
Upvotes: 18
Reputation: 41
Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:
sorted = df.sort_values(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last')
Upvotes: 4
Reputation: 471
If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.
sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values
This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.
Upvotes: 23
Reputation: 347
I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.
df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})
print df
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])
Original
CARD_NO DATE
0 000 2006-12-31 20:11:39
1 001 2006-12-27 20:11:53
2 002 2006-12-28 20:12:11
3 002 2006-12-28 20:12:13
4 001 2008-12-27 20:11:53
5 111 2006-12-30 20:11:39
Returned dataframe:
CARD_NO
000 2006-12-31 20:11:39
001 2008-12-27 20:11:53
002 2006-12-28 20:12:13
111 2006-12-30 20:11:39
Upvotes: 0
Reputation: 49886
The aggregate() method on groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)
In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]:
data_date value
obj_id
1 2009-07-28 15860
2 2011-10-01 35200
4 2011-11-01 59500
You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.
In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]:
diff
obj_id
1 0
2 165000
4 34640
Upvotes: 2
Reputation: 516
I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):
df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])
it uses numpys "argmax" function to find the rowindex in which the maximum appears.
Upvotes: 5