jhutch
jhutch

Reputation: 15

efficiently find maxes of one column over id in a Pandas dataframe

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.

This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:

df= DataFrame({'id':[1000,1000,1001,2000,2000,2000], 
          "date":[20010101,20010201,20010115,20010203,20010223,20010220],
          "value":[3,1,4,2,6,6], 
          "seg1":[22,76,23,45,12,53],
          "seg2":[23,"",34,52,24,45],
          "seg3":[90,"",32,"",34,54],
          "seg4":["","",32,"",43,12],
          "seg5":["","","","",43,21],
          "seg6":["","","","",43,24]})
df
       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
1  20010201  1000    76                               1
2  20010115  1001    23   34   32   32                4
3  20010203  2000    45   52                          2
4  20010223  2000    12   24   34   43   43   41      6
5  20010220  2000    12   24   34   43   44   35      6

And eventually it should be:

       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   41      6

I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:

for i in df.id.unique():
    df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])

But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.

Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.

Upvotes: 0

Views: 3265

Answers (1)

Roberto Soares
Roberto Soares

Reputation: 264

I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.

If you do:

df.groupby('id', sort = False).max()

You will get:

          date  value
id                   
1000  20010201      3
1001  20010115      4
2000  20010223      6

And if you don't want id as the index, you do:

df.groupby('id', sort = False, as_index = False).max()

And you will get:

     id      date  value
0  1000  20010201      3
1  1001  20010115      4
2  2000  20010223      6

I don't know if that's going to be much faster, though.

Update

This way the index will not be reseted:

df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]

And you will get:

           date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   43      6

Upvotes: 1

Related Questions