Suppose I have the following dataframe df = pd.DataFrame([ (2, 2, 'A', .5), (2, 2, 'A', .6), (2, 2, 'B', .75), (2, 2, 'B', .7), (2, 2, 'C', .6), (2, 3, 'A', .65), (2, 3, 'A', .6), (2, 3, 'B', .75), (2, 3, 'B', .7), (2, 3, 'C', .6) ], columns=['out_size', 'problem_size', 'algo', 'time']) I want to group by `[out_size', 'problem_size', 'algo'], and for each group count the number of occurrences for each algo , then select/keep the algo that has the lowest average time in that group, result pd.DataFrame( [[2, 2, 'A', 0.55], [2, 3, 'C', 0.6]], columns=['out_size', 'problem_size', 'algo', 'time'])

Reputation: 1971

How to select the entry with the lowest mean per group

Suppose I have the following dataframe

df = pd.DataFrame([
    (2, 2, 'A', .5),
    (2, 2, 'A', .6),
    (2, 2, 'B', .75),
    (2, 2, 'B', .7),
    (2, 2, 'C', .6),

    (2, 3, 'A', .65),
    (2, 3, 'A', .6),
    (2, 3, 'B', .75),
    (2, 3, 'B', .7),
    (2, 3, 'C', .6)
], columns=['out_size', 'problem_size', 'algo', 'time'])

I want to

group by `[out_size', 'problem_size', 'algo'], and for each group
~~count the number of occurrences for each algo, then~~
select/keep the algo that has the lowest average time in that group,

result

pd.DataFrame(
      [[2, 2, 'A', 0.55],
       [2, 3, 'C', 0.6]], columns=['out_size', 'problem_size', 'algo', 'time'])

Upvotes: 1

Answers (6)

G.G

Reputation: 765

df.groupby(['out_size', 'problem_size']).apply(lambda dd:dd.groupby('algo')['time'].mean().nsmallest(1)).reset_index()

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 0

BeRT2me

Reputation: 13251

cols = ['out_size', 'problem_size', 'algo']
out = df.groupby(cols)['time'].mean().unstack('algo').agg(['idxmin', 'min'], axis=1).reset_index()
out.rename(columns={'idxmin':'algo', 'min':'time'}, inplace=True)
print(out)

Output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C   0.6

Upvotes: 1

Lekshman Natarajan

Reputation: 71

You can do in one line by - grouping values, sorting & grouping again:

df1 = df.groupby(['out_size', 'problem_size', 'algo'])['time'].mean().reset_index()
df1 = df1.sort_values(['out_size', 'problem_size', 'time'])
df1 = df1.groupby(['out_size', 'problem_size']).head(1)
df1

Output:

   out_size  problem_size algo  time
      2             2      A     0.55
      2             3      C     0.60

Upvotes: 1

SomeDude

Reputation: 14238

You can do - first groupby the three columns and then agg by mean and then sort ascending and then you want to take only the first in the groups, so drop duplicates based on out_size and problem_size and keeping only the first which will guarantee you get the lowest average algo because it was already sorted.

df.groupby(['out_size', 'problem_size', 'algo'],as_index=False).agg(
    {'time':'mean'}).sort_values(
    by='time',ignore_index=True).drop_duplicates(
    subset=['out_size', 'problem_size'], keep='first', ignore_index=True)

Or if you have more columns that all need average agg, then

agg_cols = ['time', #someother columns]
groupby_cols = df.columns.drop(agg_cols).tolist()
g = df.groupby(groupby_cols,as_index=False).agg('mean')
g.sort_values(by='time',ignore_index=True).drop_duplicates(
               subset=['out_size','problem_size'],ignore_index=True, keep='first')

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 1

mattiatantardini

Reputation: 793

You can do this with two groupbys:

group_cols = ["out_size", "problem_size", "algo"]
result = df.groupby(group_cols)["time"].mean().reset_index(drop=False)

computes the mean time for each group. Then you can sort the time in ascending order:

result = result.sort_values("time", ascending=True)

and then group again (without column algo) and take the first element of each group:

result = result.groupby(["out_size", "problem_size"])[["algo","time"]].first().reset_index(drop=False)

Upvotes: 1

mozway

Reputation: 262549

You can use a double groupby:

cols = ['out_size', 'problem_size', 'algo']

out = (df
 .groupby(cols, as_index=False)['time'].mean()
 .sort_values(by='time')
 .groupby(cols[:-1], as_index=False).first()
)

Slightly more efficient alternative that doesn't require to sort the values (but requires to store an intermediate):

cols = ['out_size', 'problem_size', 'algo']

out = df.groupby(cols)['time'].mean()
out = out.loc[out.groupby(cols[:-1]).idxmin()].reset_index()

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

Upvotes: 2

How to select the entry with the lowest mean per group

Answers (6)

Related Questions