Reputation: 813
Suppose I have a DataFrame like
import pandas as pd
df = pd.DataFrame({
'Id' : [1,2,3,4,5,6,7,8,9],
'Group' : [1,1,2,2,2,2,3,3,3],
'Value_to_compare' : [2,1,5,8,2,3,10,23,17],
'Other_value' : [0,3,2,6,3,4,2,7,1]
})
I would like to create a new column, say Value_of_Highest
, displaying for each row Other_value
of element having the highest Value_to_compare
of its Group
. For example, here:
Value_to_compare
is 2, for Id
= 1, for which Other_value
is 0Value_to_compare
is 8, for Id
= 4, for which Other_value
is 6Value_to_compare
is 23, for Id
= 8, for which Other_value
is 7So I would like to add a column so that df becomes
This is the best way I know to do this:
def my_func(x):
x = x.sort_values('Value_to_compare',ascending = False)
Value_of_Highest = x.head(1)['Other_value'].values[0]
return pd.Series([Value_of_Highest], index=['Value_of_Highest'])
grouped = df.groupby('Group').apply(my_func).reset_index()
df = df.merge(grouped)
I am pretty sure there is a far more elegant and efficient way to do this in Python/Pandas.
Edit: after first answer from @CameronRiddell, I realized my example was flawed. I corrected it and @CameronRiddell edited his answer, which works well.
Upvotes: 0
Views: 189
Reputation: 13407
This is a great application of groupby
and transform
essentially, using transform from a groupby object applies a function to each group, but then returns a Series or DataFrame of the same size as that group. This results in DataFrame/Series whose shape is the same as the original shape along the groupby axis. (e.g. in your case, the result of a groupby/transform will have the same number of rows as your original dataframe).
df["Value_of_Highest"] = df.groupby("Group")["Value_to_compare"].transform("max")
print(df)
Id Group Value_to_compare Other_value Value_of_Highest
0 1 1 2 0 2
1 2 1 1 3 2
2 3 2 5 2 8
3 4 2 8 6 8
4 5 2 2 3 8
5 6 2 3 4 8
6 7 3 10 2 23
7 8 3 23 7 23
8 9 3 17 1 23
df.groupby("Group")
: group the dataframe by our "Group" column["Value_to_compare"]
: From each of those groups, select the "Value_to_compare" column.transform("max")
: Get the maximum value of our selected column for each group. Then return a Series who has the same number of rows as the groupEdit: To obtain a value based on the location of a maximum, we use idxmax()
. This returns the position where a max/min occurs. So for our use case:
highest_vtc_indices = df.groupby("Group")["Value_to_compare"].idxmax()
print(highest_vtc_indices)
Group
1 0
2 3
3 7
Name: Value_to_compare, dtype: int64
0, 3, 7 corresponds to the row ID where the maximum values of "Value_to_compare" occur in each group.
Now that we know the row these maximums occurred in:
# Obtain "Group" & "Other_value" at rows 0, 3, 7
# which were the rows for each highest Value_to_compare per group
highest_other_values = df.loc[highest_vtc_indices, ["Group", "Other_value"]]
# Rename Other_value to Value_of_Highest
highest_other_values = highest_other_values.rename(columns={"Other_value": "Value_of_Highest"})
print(highest_other_values)
Group Value_of_Highest
0 1 0
3 2 6
7 3 7
Now that we have the corresponding "Other_value" to the maximum "Value_to_compare" per group, we'll use a merge
to align our new smaller dataframe back to our original dataframe. This will broadcast "Value_of_Highest" across the "Group" column from the original dataframe.
# Merge this new dataframe back to our old one to broadcast "Value_of_Highest" across each group
final = df.merge(highest_other_values, on="Group")
print(final)
Id Group Value_to_compare Other_value Value_of_Highest
0 1 1 2 0 0
1 2 1 1 3 0
2 3 2 5 2 6
3 4 2 8 6 6
4 5 2 2 3 6
5 6 2 3 4 6
6 7 3 10 2 7
7 8 3 23 7 7
8 9 3 17 1 7
Upvotes: 2