Reputation: 1563
I have this dataframe which entries are:
In [77]: df.loc[1]
Out[77]:
img 410T1_B03_S06_W2_cell1_ann.tif
immean 1302
imvar 101773
imrange 2640
imtotalvar 63321
method maxminratio
thresh 1.01
cont 8
minz 2
cent 50
zs 1
localmax F
sha 1383
tp 3
fp 44
fn 0
time 139.4
precision 0.0638298
sensitivity 1
score 0.12
Name: 1, dtype: object
with dtypes:
In [79]: df.dtypes
Out[79]:
img category
immean int64
imvar int64
imrange int64
imtotalvar int64
method category
thresh float64
cont category
minz category
cent category
zs category
localmax category
sha category
tp float64
fp float64
fn float64
time float64
precision float64
sensitivity float64
score float64
dtype: object
I need to group them by 'method'
, then subgroup them by same sets of ['cont','minz','cent','zs','localmax','sha']
, and once this is done, get the row that has the best 'score'
(while preserving other infos).
I've did some really dirty code that works that does the job, but I'd like to get it done the right pandasic way:
def multiplecond(df,listvar,listvalues):
res = df[listvar[0]]==listvalues[0]
for var,val in zip(listvar[1:],listvalues[1:]):
res = (res) & (df[var]==val)
return res
falseparams = ['cont','minz','cent','zs','localmax','sha']
falseparamsvalues = [list(df[x].values.categories) for x in falseparams]
falseparamssets = [[a,b,c,d,e,f] for a in falseparamsvalues[0]
for b in falseparamsvalues[1]
for c in falseparamsvalues[2]
for d in falseparamsvalues[3]
for e in falseparamsvalues[4]
for f in falseparamsvalues[5]]
spe = {}
for method in df.method.values.categories:
for falseparamsset in falseparamssets:
df2 = df[multiplecond(df,['method']+falseparams,[method]+falseparamsset)]
if not df2.empty:
n = method + '_'.join([''.join([str(x),str(y)]) for x,y in
zip(falseparams,falseparamsset)])
spe[n]=df2.copy()
df2 = []
for d in spe:
# get best score (=best thresh) for method
g = spe[d].groupby('img')
g = g['score'].max()
df2 += [[d,g.mean()]]
# visually verify thresh range
df2 = pd.DataFrame(df2,columns=['method','maxscore'])
df2.sort_values(by='score',ascending=False,inplace=True)
What is the proper way to do that ?
EDIT: here's a pastable version of the 200 first entries of my dataframe http://pastebin.com/r5uAiyHU
EDIT2: thanks to firelynx's answer, I managed to do something a little less dirty. Here's how it goes:
gbyimgbymet=df.groupby(['img','method','minz','zst','minshape'])
idx = [list(a[1][a[1]['score']==a[1].score.max()].index) for a in gbyimgbymet]
a=[]
for i in idx:
a+=i
bestscoresbyimgbymet = df.loc[a]
I know there's probably slightly better outthere, especially to join the indexes. The reason why firelynx's answer doesn't really fit is that I need all the rows with max value, and not only one, which is what argmax
returns. For this reason, I probably cannot use the agg
method as well. Correct me if I'm wrong.
Upvotes: 0
Views: 146
Reputation: 32224
Ok, your code is very dirty and I think you have over-engineered your solution, so I will simply give you an example of how I would do this conceptually, using cleaner example code.
My example dataframe:
a b c othervalue
0 1 a z 100
1 1 b x 101
2 1 c y 102
3 2 d v 103
4 2 e u 104
5 2 f t 105
Using the argmax
, we can get the index of the value which is highest in the group.
df.groupby('a').agg({'othervalue':pd.Series.argmax})
othervalue
a
1 2
2 5
Now we can use that value inside the .loc
method to get the whole rows from the original dataframe.
max_scores = df.groupby('a').agg({'othervalue':pd.Series.argmax})['othervalue']
df.loc[max_scores]
a b c othervalue
2 1 c y 102
5 2 f t 105
If you have multiple rows matching the maximum value, you will have to do something a bit different, and one more step.
a b c othervalue
0 1 a z 100
1 1 b x 101
2 1 c y 102
3 2 d v 103
4 2 e u 104
5 2 f t 105
6 1 a z 100
7 1 b x 101
8 1 c y 102
9 2 d v 103
10 2 e u 104
11 2 f t 105
With the above example, first we get the maximum values in each group, and reset the index so we can use it for the coming merge.
maxvalues_per_group = df.groupby('a').agg({'othervalue':pd.np.max})
maxvalues_per_group.reset_index(inplace=True)
With these values, we merge on the original dataframe again to get all rows that matches the maximum values in each group.
df.merge(on=['a', 'othervalue'], right=maxvalues_per_group, how='inner')
a b c othervalue
0 1 c y 102
1 1 c y 102
2 2 f t 105
3 2 f t 105
Upvotes: 2