Speed up Pandas filtering

Question

I have a 37456153 rows x 3 columns Pandas dataframe consisting of the following columns: [Timestamp, Span, Elevation]. Each Timestamp value has approximately 62000 rows of Span and Elevation data, which looks like (when filtered for Timestamp = 17210, as an example):

        Timestamp       Span  Elevation
94614       17210  -0.019766     36.571
94615       17210  -0.019656     36.453
94616       17210  -0.019447     36.506
94617       17210  -0.018810     36.507
94618       17210  -0.017883     36.502

...           ...        ...        ...
157188      17210  91.004000     33.493
157189      17210  91.005000     33.501
157190      17210  91.010000     33.497
157191      17210  91.012000     33.500
157192      17210  91.013000     33.503

As seen above, the Span data is not equal spaced, which I actually need it to be. So I came up with the following code to convert it into an equal spaced format. I know the start and end locations I'd like to analyze. Then I defined a delta parameter as my increment. I created a numpy array called mesh, which holds the equal spaced Span data I would like to end up with. Finally, I decided the iterate over the dataframe for a given TimeStamp (17300 in the code) to test how fast it would work. The for loop in the code calculates average Elevation values for a +/- 0.5delta range at each increment.

My problem is: it takes 603 ms to filter through dataframe and calculate the average Elevation at a single iteration. For the given parameters, I have to go through 9101 iterations, resulting in approximately 1.5 hrs of computing time for this loop to end. Moreover, this is for a single Timestamp value, and I have 600 of them (900 hrs to do all?!).

Is there any way that I can speed up this loop? Thanks a lot for any input!

# MESH GENERATION
start = 0
end = 91
delta = 0.01

mesh = np.linspace(start,end, num=(end/delta + 1))
elevation_list =[]

#Loop below will take forever to run, any idea about how to optimize it?!

for current_loc in mesh:
    average_elevation = np.average(df[(df.Timestamp == 17300) & 
                                      (df.Span > current_loc - delta/2) & 
                                      (df.Span < current_loc + delta/2)].Span)
     elevation_list.append(average_elevation)

Jaime · Accepted Answer

You can vectorize the whole thing using np.searchsorted. I am not much of a pandas user, but something like this should work, and runs reasonably fast on my system. Using chrisb's dummy data:

In [8]: %%timeit
   ...: mesh = np.linspace(start, end, num=(end/delta + 1))
   ...: midpoints = (mesh[:-1] + mesh[1:]) / 2
   ...: idx = np.searchsorted(midpoints, df.Span)
   ...: averages = np.bincount(idx, weights=df.Elevation, minlength=len(mesh))
   ...: averages /= np.bincount(idx, minlength=len(mesh))
   ...: 
100 loops, best of 3: 5.62 ms per loop

That is about 3500x faster than your code:

In [12]: %%timeit
    ...: mesh = np.linspace(start, end, num=(end/delta + 1))
    ...: elevation_list =[]
    ...: for current_loc in mesh:
    ...:     average_elevation = np.average(df[(df.Span > current_loc - delta/2) & 
    ...:                                       (df.Span < current_loc + delta/2)].Span)
    ...:     elevation_list.append(average_elevation)
    ...: 
1 loops, best of 3: 19.1 s per loop

EDIT So how does this works? In midpoints we store a sorted list of the boundaries between buckets. We then do a binary search with searchsorted on this sorted list, and get idx, which basically tells us into which bucket each data point belongs. All that is left is to group all the values in each bucket. That's what bincount is for. Given an array of ints, it counts how many times each number comes up. Given an array of ints , and a corresponding array of weights, instead of adding 1 to the tally for the bucket, it adds the corresponding value in weights. With two calls to bincount you get the sum and the number of items per bucket: divide them and you get the bucket average.

Speed up Pandas filtering

Answers (2)

Related Questions