nicoco
nicoco

Reputation: 1563

Finding rows with the maximum values within a group

I have this dataframe which entries are:

In [77]: df.loc[1]
Out[77]: 
img            410T1_B03_S06_W2_cell1_ann.tif
immean                                   1302
imvar                                  101773
imrange                                  2640
imtotalvar                              63321
method                            maxminratio
thresh                                   1.01
cont                                        8
minz                                        2
cent                                       50
zs                                          1
localmax                                    F
sha                                      1383
tp                                          3
fp                                         44
fn                                          0
time                                    139.4
precision                           0.0638298
sensitivity                                 1
score                                    0.12
Name: 1, dtype: object

with dtypes:

In [79]: df.dtypes
Out[79]: 
img            category
immean            int64
imvar             int64
imrange           int64
imtotalvar        int64
method         category
thresh          float64
cont           category
minz           category
cent           category
zs             category
localmax       category
sha            category
tp              float64
fp              float64
fn              float64
time            float64
precision       float64
sensitivity     float64
score           float64
dtype: object

I need to group them by 'method', then subgroup them by same sets of ['cont','minz','cent','zs','localmax','sha'], and once this is done, get the row that has the best 'score' (while preserving other infos).

I've did some really dirty code that works that does the job, but I'd like to get it done the right pandasic way:

def multiplecond(df,listvar,listvalues):
  res = df[listvar[0]]==listvalues[0]
  for var,val in zip(listvar[1:],listvalues[1:]):
    res = (res) & (df[var]==val)
  return res

falseparams = ['cont','minz','cent','zs','localmax','sha']
falseparamsvalues = [list(df[x].values.categories) for x in falseparams]
falseparamssets = [[a,b,c,d,e,f] for a in falseparamsvalues[0]
                                 for b in falseparamsvalues[1]
                                 for c in falseparamsvalues[2]
                                 for d in falseparamsvalues[3]
                                 for e in falseparamsvalues[4]
                                 for f in falseparamsvalues[5]]

spe = {}
for method in df.method.values.categories:
  for falseparamsset in falseparamssets:
    df2 = df[multiplecond(df,['method']+falseparams,[method]+falseparamsset)]
    if not df2.empty:
      n = method + '_'.join([''.join([str(x),str(y)]) for x,y in
                             zip(falseparams,falseparamsset)])
      spe[n]=df2.copy()

df2 = []
for d in spe:
  # get best score (=best thresh) for method
  g = spe[d].groupby('img')
  g = g['score'].max()
  df2 += [[d,g.mean()]]
  # visually verify thresh range

df2 = pd.DataFrame(df2,columns=['method','maxscore'])
df2.sort_values(by='score',ascending=False,inplace=True)

What is the proper way to do that ?

EDIT: here's a pastable version of the 200 first entries of my dataframe http://pastebin.com/r5uAiyHU

EDIT2: thanks to firelynx's answer, I managed to do something a little less dirty. Here's how it goes:

gbyimgbymet=df.groupby(['img','method','minz','zst','minshape'])
idx = [list(a[1][a[1]['score']==a[1].score.max()].index) for a in gbyimgbymet]
a=[]
for i in idx:
  a+=i

bestscoresbyimgbymet = df.loc[a]

I know there's probably slightly better outthere, especially to join the indexes. The reason why firelynx's answer doesn't really fit is that I need all the rows with max value, and not only one, which is what argmax returns. For this reason, I probably cannot use the agg method as well. Correct me if I'm wrong.

Upvotes: 0

Views: 146

Answers (1)

firelynx
firelynx

Reputation: 32224

Ok, your code is very dirty and I think you have over-engineered your solution, so I will simply give you an example of how I would do this conceptually, using cleaner example code.

My example dataframe:

   a  b  c  othervalue
0  1  a  z         100
1  1  b  x         101
2  1  c  y         102
3  2  d  v         103
4  2  e  u         104
5  2  f  t         105

Using the argmax, we can get the index of the value which is highest in the group.

df.groupby('a').agg({'othervalue':pd.Series.argmax})
   othervalue
a            
1           2
2           5

Now we can use that value inside the .loc method to get the whole rows from the original dataframe.

max_scores = df.groupby('a').agg({'othervalue':pd.Series.argmax})['othervalue']
df.loc[max_scores]
   a  b  c  othervalue
2  1  c  y         102
5  2  f  t         105

Multiple rows with maximum value (Question extension)

If you have multiple rows matching the maximum value, you will have to do something a bit different, and one more step.

    a  b  c  othervalue
0   1  a  z         100
1   1  b  x         101
2   1  c  y         102
3   2  d  v         103
4   2  e  u         104
5   2  f  t         105
6   1  a  z         100
7   1  b  x         101
8   1  c  y         102
9   2  d  v         103
10  2  e  u         104
11  2  f  t         105

With the above example, first we get the maximum values in each group, and reset the index so we can use it for the coming merge.

maxvalues_per_group = df.groupby('a').agg({'othervalue':pd.np.max})
maxvalues_per_group.reset_index(inplace=True)

With these values, we merge on the original dataframe again to get all rows that matches the maximum values in each group.

df.merge(on=['a', 'othervalue'], right=maxvalues_per_group, how='inner')

   a  b  c  othervalue
0  1  c  y         102
1  1  c  y         102
2  2  f  t         105
3  2  f  t         105

Upvotes: 2

Related Questions