Speed up dataframe loop

Question

I'm running two different but very similar loops over a pandas dataframe and am wondering whether there is some sort of groupby operation that would allow me to speed this up by avoiding a loop.

for x in df.var1:
    df[df.var1==x, 'var2'] = np.max(df[df.var1==x, 'var2'])

That is, given that there are multiple rows with the same value of var1, I want to set the value of var2 for all of these rows to the maximum that var2 obtains over all these rows.

I feel like I should be able to do this without a for loop, but for some reason I can't figure out how. Ideas?

Zachary Cross · Accepted Answer

It looks like you want to replace a column with the max value in that column, grouped by the values in another column. You should be able to use groupby() and transform(max) to get what you want:

>>> import pandas as pd
>>> df = pd.DataFrame({"var1": [1, 1, 2, 2, 3, 3], 'var2': [1, 2, 3, 4, 5, 6]})
>>> df
   var1  var2
0     1     1
1     1     2
2     2     3
3     2     4
4     3     5
5     3     6
>>> df['var2'] = df.groupby('var1').transform(max)
>>> df
   var1  var2
0     1     2
1     1     2
2     2     4
3     2     4
4     3     6
5     3     6

Speed up dataframe loop

Answers (1)

Related Questions