pythonRcpp
pythonRcpp

Reputation: 2146

Take max from multiple grouped data pandas

I am in a loop which gives me a groupby output like below df.groupby(['grp1','grp2'])['mycol'].sum()

Basically I am getting sum of my grouped elements.

grp1  grp2 
A     1    10 
B     1    20
C     2    30 
D     3    40 
E     4    50 
      1    60 

Now in the next iteration I may get a grouped df like below

grp1  grp2 
A     1    20 
D     3    40 
E     4    30 
      1    90 
F     1    40

I want to take the max from each iteration. So after 2nd iteration I have an output like

grp1  grp2 
A     1    20 #because 20 was higher than 10
B     1    20 #carried as it is
C     2    30 #carried as it is
D     3    40 #carried as it is (both were equal)
E     4    30 #because 90+30 >50+60
      1    90 
F     1    40 #added

So by the end I have which group reached peak values during say 5 iterations. It sounds straight forward ( to keep track of max seen till now), but I am not getting how to approach this. I tried doing df.groupby(['grp1','grp2'])['mycol'].sum().to_dict() and do something like updating dict on reading new df. (just a try, not sure how to keep dict updated) or maybe there is a simple pandas and np solution I still dont know.

Upvotes: 1

Views: 50

Answers (1)

thorbjornwolf
thorbjornwolf

Reputation: 1848

You can use groupby again! :-)

  1. Concatenate your results (pd.concat),
  2. Group by the index (pd.Series.groupby, see level param),
  3. Aggregate with .max()

Shown in code:

res = pd.concat(list_of_iteration_results)
res = res.groupby(level=['grp1', 'grp2']).max()

Upvotes: 2

Related Questions