enrishi
enrishi

Reputation: 3083

pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column)

I have the following dataframe:

   obj_id   data_date   value
0  4        2011-11-01  59500    
1  2        2011-10-01  35200 
2  4        2010-07-31  24860   
3  1        2009-07-28  15860
4  2        2008-10-15  200200

I want to get a subset of this data so that I only have the most recent (largest 'data_date') 'value' for each 'obj_id'.

I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.

My method is essentially to group, sort, retrieve, and recombine as follows:

row_arr = []
for grp, grp_df in df.groupby('obj_id'):
    row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0])

df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value'))

Upvotes: 28

Views: 25456

Answers (7)

Soudipta Dutta
Soudipta Dutta

Reputation: 2162

df1 = pd.DataFrame({
'Id': ['00', '01', '02', '02', '01', '03'] ,

'date': ['1990-12-31 ','1990-12-27 ','1990-12-28 ',
         '1990-12-28 ','1992-12-27 ','1990-12-30 '] , 
 
 'Population': ['700','200','300','400','500','100']        
         })
print(df1)

"""
   Id         date Population
0  00  1990-12-31         700
1  01  1990-12-27         200
2  02  1990-12-28         300
3  02  1990-12-28         400
4  01  1992-12-27         500
5  03  1990-12-30         100
"""



Max1 = df1.groupby('Id').apply( lambda df : df['Population'].values[df['Population'].values.argmax()]  )


print(Max1)

"""
Id
00    700
01    500
02    400
03    100
dtype: object
"""

Min1 = df1.groupby('Id').apply(lambda df : df['Population'].values[df['Population'].values.argmin()])

print(Min1)

"""
Id
00    700
01    200
02    300
03    100
dtype: object

"""

METHOD 2 :

cc = df1.sort_values('Population', ascending=False).drop_duplicates(['Id'])
print(cc)

"""
   Id         date Population
0  00  1990-12-31         700
4  01  1992-12-27         500
3  02  1990-12-28         400
5  03  1990-12-30         100
"""

METHOD 3 :

aa = df1.groupby(['Id'],sort = False)['Population'].max()
print(aa)
"""
Id
00    700
01    500
02    400
03    100
Name: Population, dtype: object
"""

METHOD 4:

res = df1.groupby(['Id'])['Population'].transform(max) == df1['Population']

print(df1[res])

"""
   Id         date Population
0  00  1990-12-31         700
3  02  1990-12-28         400
4  01  1992-12-27         500
5  03  1990-12-30         100
"""

Upvotes: 1

pdifranc
pdifranc

Reputation: 181

This is another possible solution. Dont know if this is the fastest (I doubt..) since I have not benchmarked it against other approaches.

df.loc[df.groupby('obj_id').data_date.idxmax(),:]

Upvotes: 18

Tamelise
Tamelise

Reputation: 41

Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:

sorted = df.sort_values(by='data_date')

result = sorted.drop_duplicates('obj_id', keep='last')

Upvotes: 4

thetainted1
thetainted1

Reputation: 471

If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.

sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values

This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.

Upvotes: 23

Zihs
Zihs

Reputation: 347

I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.

df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})

print df 
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])

Original

CARD_NO                 DATE
0     000  2006-12-31 20:11:39
1     001  2006-12-27 20:11:53
2     002  2006-12-28 20:12:11
3     002  2006-12-28 20:12:13
4     001  2008-12-27 20:11:53
5     111  2006-12-30 20:11:39

Returned dataframe:

CARD_NO
000        2006-12-31 20:11:39
001        2008-12-27 20:11:53
002        2006-12-28 20:12:13
111        2006-12-30 20:11:39

Upvotes: 0

Garrett
Garrett

Reputation: 49886

The aggregate() method on groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)

In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]: 
         data_date  value
obj_id                   
1       2009-07-28  15860
2       2011-10-01  35200
4       2011-11-01  59500

You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.

In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]: 
          diff
obj_id        
1            0
2       165000
4        34640

Upvotes: 2

Maximilian
Maximilian

Reputation: 516

I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):

df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])

it uses numpys "argmax" function to find the rowindex in which the maximum appears.

Upvotes: 5

Related Questions