marillion
marillion

Reputation: 11170

Speed up Pandas filtering

I have a 37456153 rows x 3 columns Pandas dataframe consisting of the following columns: [Timestamp, Span, Elevation]. Each Timestamp value has approximately 62000 rows of Span and Elevation data, which looks like (when filtered for Timestamp = 17210, as an example):

        Timestamp       Span  Elevation
94614       17210  -0.019766     36.571
94615       17210  -0.019656     36.453
94616       17210  -0.019447     36.506
94617       17210  -0.018810     36.507
94618       17210  -0.017883     36.502

...           ...        ...        ...
157188      17210  91.004000     33.493
157189      17210  91.005000     33.501
157190      17210  91.010000     33.497
157191      17210  91.012000     33.500
157192      17210  91.013000     33.503

As seen above, the Span data is not equal spaced, which I actually need it to be. So I came up with the following code to convert it into an equal spaced format. I know the start and end locations I'd like to analyze. Then I defined a delta parameter as my increment. I created a numpy array called mesh, which holds the equal spaced Span data I would like to end up with. Finally, I decided the iterate over the dataframe for a given TimeStamp (17300 in the code) to test how fast it would work. The for loop in the code calculates average Elevation values for a +/- 0.5delta range at each increment.

My problem is: it takes 603 ms to filter through dataframe and calculate the average Elevation at a single iteration. For the given parameters, I have to go through 9101 iterations, resulting in approximately 1.5 hrs of computing time for this loop to end. Moreover, this is for a single Timestamp value, and I have 600 of them (900 hrs to do all?!).

Is there any way that I can speed up this loop? Thanks a lot for any input!

# MESH GENERATION
start = 0
end = 91
delta = 0.01

mesh = np.linspace(start,end, num=(end/delta + 1))
elevation_list =[]

#Loop below will take forever to run, any idea about how to optimize it?!

for current_loc in mesh:
    average_elevation = np.average(df[(df.Timestamp == 17300) & 
                                      (df.Span > current_loc - delta/2) & 
                                      (df.Span < current_loc + delta/2)].Span)
     elevation_list.append(average_elevation)

Upvotes: 3

Views: 3287

Answers (2)

Jaime
Jaime

Reputation: 67427

You can vectorize the whole thing using np.searchsorted. I am not much of a pandas user, but something like this should work, and runs reasonably fast on my system. Using chrisb's dummy data:

In [8]: %%timeit
   ...: mesh = np.linspace(start, end, num=(end/delta + 1))
   ...: midpoints = (mesh[:-1] + mesh[1:]) / 2
   ...: idx = np.searchsorted(midpoints, df.Span)
   ...: averages = np.bincount(idx, weights=df.Elevation, minlength=len(mesh))
   ...: averages /= np.bincount(idx, minlength=len(mesh))
   ...: 
100 loops, best of 3: 5.62 ms per loop  

That is about 3500x faster than your code:

In [12]: %%timeit
    ...: mesh = np.linspace(start, end, num=(end/delta + 1))
    ...: elevation_list =[]
    ...: for current_loc in mesh:
    ...:     average_elevation = np.average(df[(df.Span > current_loc - delta/2) & 
    ...:                                       (df.Span < current_loc + delta/2)].Span)
    ...:     elevation_list.append(average_elevation)
    ...: 
1 loops, best of 3: 19.1 s per loop

EDIT So how does this works? In midpoints we store a sorted list of the boundaries between buckets. We then do a binary search with searchsorted on this sorted list, and get idx, which basically tells us into which bucket each data point belongs. All that is left is to group all the values in each bucket. That's what bincount is for. Given an array of ints, it counts how many times each number comes up. Given an array of ints , and a corresponding array of weights, instead of adding 1 to the tally for the bucket, it adds the corresponding value in weights. With two calls to bincount you get the sum and the number of items per bucket: divide them and you get the bucket average.

Upvotes: 6

chrisb
chrisb

Reputation: 52246

Here's an idea - may still be too slow but I thought I'd share. First, some dummy data.

df = pd.DataFrame(data={'Timestamp': 17210, 
                        'Span': np.linspace(-1, 92, num=60000), 
                        'Elevation': np.linspace(33., 37., num=60000)})

Then, take mesh array you created, turn it into a dataframe, and add a shifted entry, so each entry in the dataframe represents one step of the new even Span.

mesh_df = pd.DataFrame(mesh, columns=['Equal_Span'])
mesh_df['Equal_Span_Prev'] = mesh_df['Equal_Span'].shift(1)
mesh_df = mesh_df.dropna()

Next, I want to join this dataframe with the the larger dataset, based on the entry being between the two Equal_Span columns. There may be a way in pandas, but cartesian-type joins seem much easier to express in SQL, so first, I'm shipping all the data to an in-memory sqlite database. If you run into memory problems I'd make this a file based db.

import sqlite3
con = sqlite3.connect(':memory:')
df.to_sql('df', con, index=False)
mesh_df.to_sql('mesh_df', con, index=False)

Here's the main query. Took about 1m30s on my test data, so this will likely still take a long time on the full dataset.

join_df = pd.read_sql("""SELECT a.Timestamp, a.Span, a.Elevation, b.Equal_Span
                         FROM df a, mesh_df b
                         WHERE a.Span BETWEEN b.Equal_Span_Prev AND b.Equal_Span""", con)

But once the data is in this form, it's easy/fast to get the desired mean.

join_df.groupby(['Timestamp','Equal_Span'])['Elevation'].mean()

Upvotes: 1

Related Questions