Reputation: 11170
I have a 37456153 rows x 3 columns Pandas dataframe consisting of the following columns: [Timestamp, Span, Elevation]
. Each Timestamp
value has approximately 62000 rows of Span
and Elevation
data, which looks like (when filtered for Timestamp = 17210
, as an example):
Timestamp Span Elevation
94614 17210 -0.019766 36.571
94615 17210 -0.019656 36.453
94616 17210 -0.019447 36.506
94617 17210 -0.018810 36.507
94618 17210 -0.017883 36.502
... ... ... ...
157188 17210 91.004000 33.493
157189 17210 91.005000 33.501
157190 17210 91.010000 33.497
157191 17210 91.012000 33.500
157192 17210 91.013000 33.503
As seen above, the Span
data is not equal spaced, which I actually need it to be. So I came up with the following code to convert it into an equal spaced format. I know the start
and end
locations I'd like to analyze. Then I defined a delta
parameter as my increment. I created a numpy array called mesh
, which holds the equal spaced Span
data I would like to end up with. Finally, I decided the iterate over the dataframe for a given TimeStamp
(17300 in the code) to test how fast it would work. The for loop in the code calculates average Elevation
values for a +/- 0.5delta
range at each increment.
My problem is: it takes 603 ms to filter through dataframe and calculate the average Elevation
at a single iteration. For the given parameters, I have to go through 9101 iterations, resulting in approximately 1.5 hrs of computing time for this loop to end. Moreover, this is for a single Timestamp
value, and I have 600 of them (900 hrs to do all?!).
Is there any way that I can speed up this loop? Thanks a lot for any input!
# MESH GENERATION
start = 0
end = 91
delta = 0.01
mesh = np.linspace(start,end, num=(end/delta + 1))
elevation_list =[]
#Loop below will take forever to run, any idea about how to optimize it?!
for current_loc in mesh:
average_elevation = np.average(df[(df.Timestamp == 17300) &
(df.Span > current_loc - delta/2) &
(df.Span < current_loc + delta/2)].Span)
elevation_list.append(average_elevation)
Upvotes: 3
Views: 3287
Reputation: 67427
You can vectorize the whole thing using np.searchsorted
. I am not much of a pandas user, but something like this should work, and runs reasonably fast on my system. Using chrisb's dummy data:
In [8]: %%timeit
...: mesh = np.linspace(start, end, num=(end/delta + 1))
...: midpoints = (mesh[:-1] + mesh[1:]) / 2
...: idx = np.searchsorted(midpoints, df.Span)
...: averages = np.bincount(idx, weights=df.Elevation, minlength=len(mesh))
...: averages /= np.bincount(idx, minlength=len(mesh))
...:
100 loops, best of 3: 5.62 ms per loop
That is about 3500x faster than your code:
In [12]: %%timeit
...: mesh = np.linspace(start, end, num=(end/delta + 1))
...: elevation_list =[]
...: for current_loc in mesh:
...: average_elevation = np.average(df[(df.Span > current_loc - delta/2) &
...: (df.Span < current_loc + delta/2)].Span)
...: elevation_list.append(average_elevation)
...:
1 loops, best of 3: 19.1 s per loop
EDIT So how does this works? In midpoints
we store a sorted list of the boundaries between buckets. We then do a binary search with searchsorted
on this sorted list, and get idx
, which basically tells us into which bucket each data point belongs. All that is left is to group all the values in each bucket. That's what bincount
is for. Given an array of ints, it counts how many times each number comes up. Given an array of ints , and a corresponding array of weights
, instead of adding 1 to the tally for the bucket, it adds the corresponding value in weights
. With two calls to bincount
you get the sum and the number of items per bucket: divide them and you get the bucket average.
Upvotes: 6
Reputation: 52246
Here's an idea - may still be too slow but I thought I'd share. First, some dummy data.
df = pd.DataFrame(data={'Timestamp': 17210,
'Span': np.linspace(-1, 92, num=60000),
'Elevation': np.linspace(33., 37., num=60000)})
Then, take mesh array you created, turn it into a dataframe, and add a shifted entry, so each entry in the dataframe represents one step of the new even Span.
mesh_df = pd.DataFrame(mesh, columns=['Equal_Span'])
mesh_df['Equal_Span_Prev'] = mesh_df['Equal_Span'].shift(1)
mesh_df = mesh_df.dropna()
Next, I want to join this dataframe with the the larger dataset, based on the entry being between the two Equal_Span
columns. There may be a way in pandas, but cartesian-type joins seem much easier to express in SQL, so first, I'm shipping all the data to an in-memory sqlite database. If you run into memory problems I'd make this a file based db.
import sqlite3
con = sqlite3.connect(':memory:')
df.to_sql('df', con, index=False)
mesh_df.to_sql('mesh_df', con, index=False)
Here's the main query. Took about 1m30s on my test data, so this will likely still take a long time on the full dataset.
join_df = pd.read_sql("""SELECT a.Timestamp, a.Span, a.Elevation, b.Equal_Span
FROM df a, mesh_df b
WHERE a.Span BETWEEN b.Equal_Span_Prev AND b.Equal_Span""", con)
But once the data is in this form, it's easy/fast to get the desired mean.
join_df.groupby(['Timestamp','Equal_Span'])['Elevation'].mean()
Upvotes: 1