Reputation: 11
I have a dataframe where, for multiple products over multiple dates, I have some numerical quantities I want to calculate rolling metrics on:
df = pd.DataFrame(data=data, columns=['DATE', 'PRODUCTS', 'NUMQUANTITY_1', 'NUMQUANTITY_2', ..., 'NUMQUANTITY_20'])
I am doing the following computations:
rolling_calculations = df.groupby('CATEGORY')\
.rolling('7D', on='DATE',closed='left')\
.apply(np.nanmean, raw=True)
which work but are quite slow: for about 2 million rows and 20 NUMQUANTITY, it takes around 30 minutes on my laptop to compute a rolling np.nanmean.
Is there any ways to make this go faster?
Thanks for your help
Upvotes: 1
Views: 1077
Reputation: 68146
I think using the mean
function built into pandas will give you a significant speed boost, though the full operation may still take some time.
Start with:
N = 1000
x = numpy.random.uniform(size=(N, 5))
dates = pandas.date_range(start='1910-01-01', freq='5T', periods=N, name='DATE')
nums = pandas.DataFrame(data=x, columns=list('ABCDE'), index=dates)
cats = pandas.DataFrame({'CATEGORY': numpy.random.choice(list('abcdefghi'), size=N)}, index=dates)
df = nums.join(cats).reset_index()
df.loc[df['A'] < 0.5, list('ABCDE')] = numpy.nan
roll = df.groupby('CATEGORY').rolling('7D', on='DATE',closed='left')
The performance is much better with:
%%timeit
roll.mean()
# 88.1 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Than:
%%timeit
roll.apply(numpy.nanmean, raw=True)
# 658 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And just to check the work, the following doesn't raise an AssertionError
:
x = roll.mean()
y = roll.apply(numpy.nanmean, raw=True)
pandas.util.testing.assert_frame_equal(x, y)
Upvotes: 1