Eduardo Reis
Eduardo Reis

Reputation: 1971

How to select the entry with the lowest mean per group

Suppose I have the following dataframe

df = pd.DataFrame([
    (2, 2, 'A', .5),
    (2, 2, 'A', .6),
    (2, 2, 'B', .75),
    (2, 2, 'B', .7),
    (2, 2, 'C', .6),

    (2, 3, 'A', .65),
    (2, 3, 'A', .6),
    (2, 3, 'B', .75),
    (2, 3, 'B', .7),
    (2, 3, 'C', .6)
], columns=['out_size', 'problem_size', 'algo', 'time'])

I want to

result

pd.DataFrame(
      [[2, 2, 'A', 0.55],
       [2, 3, 'C', 0.6]], columns=['out_size', 'problem_size', 'algo', 'time'])

Upvotes: 1

Views: 78

Answers (6)

G.G
G.G

Reputation: 765

df.groupby(['out_size', 'problem_size']).apply(lambda dd:dd.groupby('algo')['time'].mean().nsmallest(1)).reset_index()

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 0

BeRT2me
BeRT2me

Reputation: 13251

cols = ['out_size', 'problem_size', 'algo']
out = df.groupby(cols)['time'].mean().unstack('algo').agg(['idxmin', 'min'], axis=1).reset_index()
out.rename(columns={'idxmin':'algo', 'min':'time'}, inplace=True)
print(out)

Output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C   0.6

Upvotes: 1

Lekshman Natarajan
Lekshman Natarajan

Reputation: 71

You can do in one line by - grouping values, sorting & grouping again:

df1 = df.groupby(['out_size', 'problem_size', 'algo'])['time'].mean().reset_index()
df1 = df1.sort_values(['out_size', 'problem_size', 'time'])
df1 = df1.groupby(['out_size', 'problem_size']).head(1)
df1

Output:

   out_size  problem_size algo  time
      2             2      A     0.55
      2             3      C     0.60

Upvotes: 1

SomeDude
SomeDude

Reputation: 14238

You can do - first groupby the three columns and then agg by mean and then sort ascending and then you want to take only the first in the groups, so drop duplicates based on out_size and problem_size and keeping only the first which will guarantee you get the lowest average algo because it was already sorted.

df.groupby(['out_size', 'problem_size', 'algo'],as_index=False).agg(
    {'time':'mean'}).sort_values(
    by='time',ignore_index=True).drop_duplicates(
    subset=['out_size', 'problem_size'], keep='first', ignore_index=True)

Or if you have more columns that all need average agg, then

agg_cols = ['time', #someother columns]
groupby_cols = df.columns.drop(agg_cols).tolist()
g = df.groupby(groupby_cols,as_index=False).agg('mean')
g.sort_values(by='time',ignore_index=True).drop_duplicates(
               subset=['out_size','problem_size'],ignore_index=True, keep='first')

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 1

mattiatantardini
mattiatantardini

Reputation: 793

You can do this with two groupbys:

group_cols = ["out_size", "problem_size", "algo"]
result = df.groupby(group_cols)["time"].mean().reset_index(drop=False)

computes the mean time for each group. Then you can sort the time in ascending order:

result = result.sort_values("time", ascending=True)

and then group again (without column algo) and take the first element of each group:

result = result.groupby(["out_size", "problem_size"])[["algo","time"]].first().reset_index(drop=False)

Upvotes: 1

mozway
mozway

Reputation: 262549

You can use a double groupby:

cols = ['out_size', 'problem_size', 'algo']

out = (df
 .groupby(cols, as_index=False)['time'].mean()
 .sort_values(by='time')
 .groupby(cols[:-1], as_index=False).first()
)

Slightly more efficient alternative that doesn't require to sort the values (but requires to store an intermediate):

cols = ['out_size', 'problem_size', 'algo']

out = df.groupby(cols)['time'].mean()
out = out.loc[out.groupby(cols[:-1]).idxmin()].reset_index()

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 2

Related Questions