Reputation: 13800
I'm running two different but very similar loops over a pandas dataframe and am wondering whether there is some sort of groupby operation that would allow me to speed this up by avoiding a loop.
for x in df.var1:
df[df.var1==x, 'var2'] = np.max(df[df.var1==x, 'var2'])
That is, given that there are multiple rows with the same value of var1
, I want to set the value of var2
for all of these rows to the maximum that var2
obtains over all these rows.
I feel like I should be able to do this without a for loop, but for some reason I can't figure out how. Ideas?
Upvotes: 3
Views: 307
Reputation: 2318
It looks like you want to replace a column with the max value in that column, grouped by the values in another column. You should be able to use groupby()
and transform(max)
to get what you want:
>>> import pandas as pd
>>> df = pd.DataFrame({"var1": [1, 1, 2, 2, 3, 3], 'var2': [1, 2, 3, 4, 5, 6]})
>>> df
var1 var2
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 3 6
>>> df['var2'] = df.groupby('var1').transform(max)
>>> df
var1 var2
0 1 2
1 1 2
2 2 4
3 2 4
4 3 6
5 3 6
Upvotes: 5