Reputation: 173
I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Upvotes: 3
Views: 1712
Reputation: 26201
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods
, center
, and closed
are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from @anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum
just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df
above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
Upvotes: 2
Reputation: 11161
df.rolling
and series.rolling
do allow for value-based windows if the index is of type DateTimeIndex
or TimedeltaIndex
. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center
isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.
Upvotes: 0