jsstuball
jsstuball

Reputation: 4911

Pandas dataframe rolling mean efficiently

I have a huge dataframe and frequently add a single row. I wish to update the rolling mean (within a recent time window) and standard deviations following the addition of a single row. Does anybody know whether pandas automatically does this in constant time using the deltas of the newly included/removed rows, or actually calculates the sums and sums-of-squares all over again?

Edit: An example has been requested to explain what I mean by constant time rolling mean update:

data_set = (1,2,3)
old_mean = 2

new_value: 4
expired_value = 1
new_mean = (old_mean * num_of_values - expired_value + new_value) / num_values
         = (2 * 3 - 1 + 4) / 3
         = 3

^Constant time. It matters for large dataframes.

Upvotes: 4

Views: 1499

Answers (1)

Viktor
Viktor

Reputation: 416

From the performance numbers and code it all looks like* pandas does not do any optimizations that you request (* I don't have in-depth knowledge of the code so maybe I'm missing something). Just have a look at how the Window operation is implemented and you won't find any filters that take any precomputed values into account. And if you think about it then you may come to the conclusion that pandas would need to remember all 'dirty' rows where the values have changed. That is quite some memory intensive operation. Having a look at performance numbers this also holds true:

import pandas
import numpy
df = pandas.DataFrame({'A':numpy.random.random(20000000)})
r = df.rolling(3)

.

%%time
r.mean()

CPU times: user 312 ms, sys: 515 ms, total: 828 ms Wall time: 836 ms

# Add one row
df.loc[len(df)] = {'A': numpy.random.random(1)[0]}

.

%%time
r.mean()

CPU times: user 334 ms, sys: 534 ms, total: 868 ms Wall time: 950 ms The times don't change much no matter how often you execute the cells. So how can you improve the times? Let's use this as a base to compare the other results with:

%timeit df.rolling(3).mean()

854 ms ± 38.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This computes the mean for all rows with a window size of 3. In order to compute the mean based on the last relevant rows for your new row then you can use pandas only to do so:

%timeit df.iloc[-3:, df.columns.get_loc('A')].mean()

192 µs ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That is already an improvement by over 4000x. But using numpy directly adds some more:

%timeit df.values[-3:].mean()

15.2 µs ± 699 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This is way above 8000x. Without a lot more effort, that's as fast as it gets. But be aware that when implementing operations on a lower level by yourself you loose some of the convenience that pandas provides (type checking and conversion etc.). That even more becomes true if you want to use Cython.

Upvotes: 3

Related Questions