Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe?

Question

I have some code right now that works fine, but it entirely too slow. I'm trying to add up the weighted sum of squares for every row in a Pandas dataframe. I'd like to vectorize the operations--that seems to run much, much faster--but there's a wrinkle in the code that has defeated my attempts to vectorize.

totalDist = 0.0
    
for index, row in pU.iterrows():
    totalDist += (row['distance'][row['schoolChoice']]**2.0*float(row['students']))

The row has 'students' (an integer), distance (a numpy array of length n), and schoolChoice (an integer less than or equal to n-1 which designates which element of the distance array I'm using for the calcuation). Basically, I'm pulling a row-specific value from the numpy array. I've used df.lookup, but that actually seems to be slower and is being deprecated. Any suggestions on how to make this run faster? Thanks in advance!

user15398259 · Accepted Answer

If all else fails you can use .apply() on each row

totalSum = df.apply(lambda row: row.distance[row.schoolChoice] ** 2 * row.students, axis=1).sum()

To go faster you can import numpy

totalSum = (numpy.stack(df.distance)[range(len(df.schoolChoice)), df.schoolChoice] ** 2 * df.students).sum()

The numpy method requires distance be the same length for each row - however it is possible to pad them to the same length if needed. (Though this may affect any gains made.)

Tested on a df of 150,000 rows like:

         distance  schoolChoice  students
0       [1, 2, 3]             0         4
1       [4, 5, 6]             2         5
2       [7, 8, 9]             2         6
3       [1, 2, 3]             0         4
4       [4, 5, 6]             2         5

Timings:

         method    time
0      for loop   15.9s
1      df.apply    4.1s
2         numpy    0.7s

Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe?

Answers (1)

Related Questions