Nils Gudat
Nils Gudat

Reputation: 13800

Speed up dataframe loop

I'm running two different but very similar loops over a pandas dataframe and am wondering whether there is some sort of groupby operation that would allow me to speed this up by avoiding a loop.

for x in df.var1:
    df[df.var1==x, 'var2'] = np.max(df[df.var1==x, 'var2'])

That is, given that there are multiple rows with the same value of var1, I want to set the value of var2 for all of these rows to the maximum that var2 obtains over all these rows.

I feel like I should be able to do this without a for loop, but for some reason I can't figure out how. Ideas?

Upvotes: 3

Views: 307

Answers (1)

Zachary Cross
Zachary Cross

Reputation: 2318

It looks like you want to replace a column with the max value in that column, grouped by the values in another column. You should be able to use groupby() and transform(max) to get what you want:

>>> import pandas as pd
>>> df = pd.DataFrame({"var1": [1, 1, 2, 2, 3, 3], 'var2': [1, 2, 3, 4, 5, 6]})
>>> df
   var1  var2
0     1     1
1     1     2
2     2     3
3     2     4
4     3     5
5     3     6
>>> df['var2'] = df.groupby('var1').transform(max)
>>> df
   var1  var2
0     1     2
1     1     2
2     2     4
3     2     4
4     3     6
5     3     6

Upvotes: 5

Related Questions