Matt
Matt

Reputation: 813

Create DataFrame column from a groupby

Suppose I have a DataFrame like

import pandas as pd
df = pd.DataFrame({
    'Id'    : [1,2,3,4,5,6,7,8,9],
    'Group' : [1,1,2,2,2,2,3,3,3],
    'Value_to_compare' : [2,1,5,8,2,3,10,23,17],
    'Other_value' : [0,3,2,6,3,4,2,7,1]
})

I would like to create a new column, say Value_of_Highest, displaying for each row Other_value of element having the highest Value_to_compare of its Group. For example, here:

So I would like to add a column so that df becomes

This is the best way I know to do this:

def my_func(x):
    x = x.sort_values('Value_to_compare',ascending = False)
    Value_of_Highest = x.head(1)['Other_value'].values[0]
    return pd.Series([Value_of_Highest], index=['Value_of_Highest']) 

grouped = df.groupby('Group').apply(my_func).reset_index()

df = df.merge(grouped)

I am pretty sure there is a far more elegant and efficient way to do this in Python/Pandas.

Edit: after first answer from @CameronRiddell, I realized my example was flawed. I corrected it and @CameronRiddell edited his answer, which works well.

Upvotes: 0

Views: 189

Answers (1)

Cameron Riddell
Cameron Riddell

Reputation: 13407

This is a great application of groupby and transform essentially, using transform from a groupby object applies a function to each group, but then returns a Series or DataFrame of the same size as that group. This results in DataFrame/Series whose shape is the same as the original shape along the groupby axis. (e.g. in your case, the result of a groupby/transform will have the same number of rows as your original dataframe).

df["Value_of_Highest"] = df.groupby("Group")["Value_to_compare"].transform("max")

print(df)

   Id  Group  Value_to_compare  Other_value  Value_of_Highest
0   1      1                 2            0                 2
1   2      1                 1            3                 2
2   3      2                 5            2                 8
3   4      2                 8            6                 8
4   5      2                 2            3                 8
5   6      2                 3            4                 8
6   7      3                10            2                23
7   8      3                23            7                23
8   9      3                17            1                23
  • df.groupby("Group"): group the dataframe by our "Group" column
  • ["Value_to_compare"]: From each of those groups, select the "Value_to_compare" column
  • .transform("max"): Get the maximum value of our selected column for each group. Then return a Series who has the same number of rows as the group

Edit: To obtain a value based on the location of a maximum, we use idxmax(). This returns the position where a max/min occurs. So for our use case:

  • we'll groupby "Group" and select the "Value_to_compare" column
  • Then we get the index of the maximum values of "Value_to_compare"
highest_vtc_indices = df.groupby("Group")["Value_to_compare"].idxmax()

print(highest_vtc_indices)
Group
1    0
2    3
3    7
Name: Value_to_compare, dtype: int64

0, 3, 7 corresponds to the row ID where the maximum values of "Value_to_compare" occur in each group.

Now that we know the row these maximums occurred in:

  • We'll need to subset our original dataframe to get the "Other_value" in each of these rows
  • We'll also need to get the "Group" column associated with these rows, so that we can align these values properly later
# Obtain "Group" & "Other_value" at rows 0, 3, 7
  # which were the rows for each highest Value_to_compare per group
highest_other_values = df.loc[highest_vtc_indices, ["Group", "Other_value"]]

# Rename Other_value to Value_of_Highest
highest_other_values = highest_other_values.rename(columns={"Other_value": "Value_of_Highest"})

print(highest_other_values)
   Group  Value_of_Highest
0      1                 0
3      2                 6
7      3                 7

Now that we have the corresponding "Other_value" to the maximum "Value_to_compare" per group, we'll use a merge to align our new smaller dataframe back to our original dataframe. This will broadcast "Value_of_Highest" across the "Group" column from the original dataframe.

# Merge this new dataframe back to our old one to broadcast "Value_of_Highest" across each group
final = df.merge(highest_other_values, on="Group")

print(final)
   Id  Group  Value_to_compare  Other_value  Value_of_Highest
0   1      1                 2            0                 0
1   2      1                 1            3                 0
2   3      2                 5            2                 6
3   4      2                 8            6                 6
4   5      2                 2            3                 6
5   6      2                 3            4                 6
6   7      3                10            2                 7
7   8      3                23            7                 7
8   9      3                17            1                 7

Upvotes: 2

Related Questions