Reputation: 15
I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
Upvotes: 0
Views: 3265
Reputation: 264
I tried with groupby('id').max()
and it works, and it also drop the rows. Did you remeber to reassign the df
variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id
as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6
Upvotes: 1