Situation I have a pandas dataframe defined as follows: import pandas as pd headers = ['Group', 'Element', 'Case', 'Score', 'Evaluation'] data = [ ['A', 1, 'x', 1.40, 0.59], ['A', 1, 'y', 9.19, 0.52], ['A', 2, 'x', 8.82, 0.80], ['A', 2, 'y', 7.18, 0.41], ['B', 1, 'x', 1.38, 0.22], ['B', 1, 'y', 7.14, 0.10], ['B', 2, 'x', 9.12, 0.28], ['B', 2, 'y', 4.11, 0.97], ] df = pd.DataFrame(data, columns=headers) which looks like this in console output: Group Element Case Score Evaluation 0 A 1 x 1.40 0.59 1 A 1 y 9.19 0.52 2 A 2 x 8.82 0.80 3 A 2 y 7.18 0.41 4 B 1 x 1.38 0.22 5 B 1 y 7.14 0.10 6 B 2 x 9.12 0.28 7 B 2 y 4.11 0.97 Problem I'd like to perform a grouping-and-aggregation operation on df that will give me the following result dataframe: Group Max_score_value Max_score_element Max_score_case Min_evaluation 0 A 9.19 1 y 0.41 1 B 9.12 2 x 0.10 To clarify in more detail: I'd like to group by the Group column, and then apply aggregation to get the following result columns: Max_score_value : the group-maximum value from the Score column. Max_score_element : the value from the Element column that corresponds to the group-maximum Score value. Max_score_case : the value from the Case column that corresponds to the group-maximum Score value. Min_evaluation : the group-minimum value from the Evaluation column. Tried thus far I've come up with the following code for the grouping-and-aggregation: result = ( df.set_index(['Element', 'Case']) .groupby('Group') .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'}) .reset_index() ) print(result) which gives as output: Group Score Evaluation max idxmax min 0 A 9.19 (1, y) 0.41 1 B 9.12 (2, x) 0.10 As you can see the basic data is there, but it's not quite in the format yet that I need. It's this last step that I'm struggling with. Does anyone here have some good ideas for generating a result dataframe in the format that I'm looking for?

Reputation: 8634

Pandas: grouping and aggregation with multiple functions

Situation

I have a pandas dataframe defined as follows:

import pandas as pd

headers = ['Group', 'Element', 'Case', 'Score', 'Evaluation']
data = [
    ['A', 1, 'x', 1.40, 0.59],
    ['A', 1, 'y', 9.19, 0.52],
    ['A', 2, 'x', 8.82, 0.80],
    ['A', 2, 'y', 7.18, 0.41],
    ['B', 1, 'x', 1.38, 0.22],
    ['B', 1, 'y', 7.14, 0.10],
    ['B', 2, 'x', 9.12, 0.28],
    ['B', 2, 'y', 4.11, 0.97],
]
df = pd.DataFrame(data, columns=headers)

which looks like this in console output:

  Group  Element Case  Score  Evaluation
0     A        1    x   1.40        0.59
1     A        1    y   9.19        0.52
2     A        2    x   8.82        0.80
3     A        2    y   7.18        0.41
4     B        1    x   1.38        0.22
5     B        1    y   7.14        0.10
6     B        2    x   9.12        0.28
7     B        2    y   4.11        0.97

Problem

I'd like to perform a grouping-and-aggregation operation on df that will give me the following result dataframe:

  Group  Max_score_value  Max_score_element  Max_score_case  Min_evaluation
0     A             9.19                  1               y            0.41 
1     B             9.12                  2               x            0.10

To clarify in more detail: I'd like to group by the Group column, and then apply aggregation to get the following result columns:

Max_score_value: the group-maximum value from the Score column.
Max_score_element: the value from the Element column that corresponds to the group-maximum Score value.
Max_score_case: the value from the Case column that corresponds to the group-maximum Score value.
Min_evaluation: the group-minimum value from the Evaluation column.

Tried thus far

I've come up with the following code for the grouping-and-aggregation:

result = (
    df.set_index(['Element', 'Case'])
    .groupby('Group')
    .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
    .reset_index()
)
print(result)

which gives as output:

  Group Score         Evaluation
          max  idxmax        min
0     A  9.19  (1, y)       0.41
1     B  9.12  (2, x)       0.10

As you can see the basic data is there, but it's not quite in the format yet that I need. It's this last step that I'm struggling with. Does anyone here have some good ideas for generating a result dataframe in the format that I'm looking for?

Upvotes: 4

Answers (4)

piRSquared

Reputation: 294338

My Take

g = df.set_index('Group').groupby(level='Group', group_keys=False)

result = g.apply(
    pd.DataFrame.nlargest, n=1, columns='Score'
)

def f(x):
    x = 'value' if x == 'Score' else x
    return 'Max_score_' + x.lower()

result.drop('Evaluation', 1).rename(columns=f).assign(
    Min_evaluation=g.Evaluation.min().values).reset_index()

  Group  Max_score_element Max_score_case  Max_score_value  Min_evaluation
0     A                  1              y             9.19            0.41
1     B                  2              x             9.12            0.10

Upvotes: 1

Allen Qin

Reputation: 19947

You can use apply instead of agg to construct all the columns in one go.

result = (
    df.groupby('Group').apply(lambda x: [np.max(x.Score),
                              df.loc[x.Score.idxmax(),'Element'],
                              df.loc[x.Score.idxmax(),'Case'],
                              np.min(x.Evaluation)])
      .apply(pd.Series)
      .rename(columns={0:'Max_score_value',
                       1:'Max_score_element',
                       2:'Max_score_case',
                       3:'Min_evaluation'})
      .reset_index()
)



result
Out[9]: 
  Group  Max_score_value  Max_score_element Max_score_case  Min_evaluation
0     A             9.19                  1              y            0.41
1     B             9.12                  2              x            0.10

Upvotes: 3

akuiper

Reputation: 214977

Starting from the result data frame, you can transform in two steps as follows to the format you need:

# collapse multi index column to single level column
result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]

# split the idxmax column into two columns
result = result.assign(
    max_score_element = result.idxmax_Score.str[0],
    max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)

result

#Group  max_Score   min_Evaluation  max_score_case  max_score_element
#0   A       9.19             0.41               y                  1
#1   B       9.12             0.10               x                  2

An alternative starting from original df using join, which may not be as efficient but less verbose similar to @tarashypka's idea:

(df.groupby('Group')
   .agg({'Score': 'idxmax', 'Evaluation': 'min'})
   .set_index('Score')
   .join(df.drop('Evaluation',1))
   .reset_index(drop=True))

#Evaluation  Group  Element   Case  Score
#0     0.41      A        1      y   9.19
#1     0.10      B        2      x   9.12

Naive timing with the example data set:

%%timeit 
(df.groupby('Group')
 .agg({'Score': 'idxmax', 'Evaluation': 'min'})
 .set_index('Score')
 .join(df.drop('Evaluation',1))
 .reset_index(drop=True))
# 100 loops, best of 3: 3.47 ms per loop

%%timeit
result = (
    df.set_index(['Element', 'Case'])
    .groupby('Group')
    .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
    .reset_index()
)

result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]

result = result.assign(
    max_score_element = result.idxmax_Score.str[0],
    max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)
# 100 loops, best of 3: 7.61 ms per loop

Upvotes: 4

Mr Tarsa

Reputation: 6652

Here is possible solution with pd.merge

>> r = df.groupby('Group') \
>>       .agg({'Score': 'idxmax', 'Evaluation': 'min'}) \
>>       .rename(columns={'Score': 'idx'})
>> for c in ['Score', 'Element', 'Case']:
>>   r = pd.merge(r, df[[c]], how='left', left_on='idx', right_index=True)
>> r.drop('Score_idx', axis=1).rename(columns={'Score': 'Max_score_value', 
>>                                             'Element': 'Max_score_element', 
>>                                             'Case': 'Max_score_case'})
       Evaluation  Max_score_value  Max_score_element Max_score_case
Group                                                               
A            0.41             9.19                  1              y
B            0.10             9.12                  2              x

Though it provides the desired output, I am not sure about if it's not less efficient than yours approach.

Upvotes: 2

Pandas: grouping and aggregation with multiple functions

Situation

Problem

Tried thus far

Answers (4)

Related Questions