Reputation: 23
I have some code right now that works fine, but it entirely too slow. I'm trying to add up the weighted sum of squares for every row in a Pandas dataframe. I'd like to vectorize the operations--that seems to run much, much faster--but there's a wrinkle in the code that has defeated my attempts to vectorize.
totalDist = 0.0
for index, row in pU.iterrows():
totalDist += (row['distance'][row['schoolChoice']]**2.0*float(row['students']))
The row has 'students' (an integer), distance (a numpy array of length n), and schoolChoice (an integer less than or equal to n-1 which designates which element of the distance array I'm using for the calcuation). Basically, I'm pulling a row-specific value from the numpy array. I've used df.lookup, but that actually seems to be slower and is being deprecated. Any suggestions on how to make this run faster? Thanks in advance!
Upvotes: 2
Views: 69
Reputation:
If all else fails you can use .apply()
on each row
totalSum = df.apply(lambda row: row.distance[row.schoolChoice] ** 2 * row.students, axis=1).sum()
To go faster you can import numpy
totalSum = (numpy.stack(df.distance)[range(len(df.schoolChoice)), df.schoolChoice] ** 2 * df.students).sum()
The numpy method requires distance be the same length for each row - however it is possible to pad them to the same length if needed. (Though this may affect any gains made.)
Tested on a df of 150,000 rows like:
distance schoolChoice students
0 [1, 2, 3] 0 4
1 [4, 5, 6] 2 5
2 [7, 8, 9] 2 6
3 [1, 2, 3] 0 4
4 [4, 5, 6] 2 5
Timings:
method time
0 for loop 15.9s
1 df.apply 4.1s
2 numpy 0.7s
Upvotes: 2