Reputation: 831
As part of some simulations I'm running, I need to eventually perform the following operation on some very long sequences of (real) numbers. Here's the gist:
Given a long 1-d NumPy array, for each position in the array I want to average up the values before and after that position, take the difference between the averages, and load those differences into another nparray of the same dimension as the original array.
Here's my attempt. It works perfectly, except that it gets super slow as the sequence gets longer.
import numpy as np
def test_sequence(npseq):
n = npseq.shape[0]
def f(i):
pre = np.sum(npseq[:i])/i
post = np.sum(npseq[(i+1):])/(n-i)
return pre-post
out = np.array([f(i) for i in range(1,n)])
return out
Seems straightforward enough. But...
In [26]: a = np.random.randint(0,100,100000)
In [27]: %timeit example.test_sequence(a)
1 loops, best of 3: 7.69 s per loop
In [17]: a = np.random.randint(0,100,400000)
In [18]: %timeit example.test_sequence(a)
1 loops, best of 3: 1min 50s per loop
I know that there's probably a smart way to vectorize this, but I'm inexperienced with NumPy. Can anyone point me in the right direction?
EDIT: I originally wrote "sum" instead of "average." I meant "average." My bad. And I know that there might be an off-by-one error there---I'm not concerned about it for now. The actual problem is slightly more complicated than the version I've presented here so I'd need to fiddle with it anyway.
Upvotes: 4
Views: 308
Reputation: 24945
As an alternative, you can also use the (soon to be deprecated) expanding_mean
function from pandas:
import pandas as pd
a = np.array([32, 69, 79, 34, 1, 77, 54, 42, 73, 75])
pd.expanding_mean(a)[:-2:] - pd.expanding_mean(a[::-1])[-3::-1]
The output here matches (I think) a fixed version of your function, but with the off by one errors, you can fix it how you'd like it:
def test_sequence(npseq):
n = npseq.shape[0]
def f(i):
pre = np.sum(npseq[:i])/i
post = np.sum(npseq[(i+1):])/(n-i-1)
return pre-post
out = np.array([f(i) for i in range(1,n-1)])
return out
test_sequence(a)
array([-22.375 , -0.35714286, 6.33333333, -10.7 ,
-18. , -14.66666667, -24.57142857, -26.5 ])
The new replacement version should use expanding and mean:
pd.Series(a[:-2:]).expanding().mean() - pd.Series(a[::-1]).expanding().mean()[-3::-1].reset_index(drop = True)
0 -22.375000
1 -0.357143
2 6.333333
3 -10.700000
4 -18.000000
5 -14.666667
6 -24.571429
7 -26.500000
dtype: float64
and some timing:
a = np.random.randint(0,100,100000)
%timeit test_sequence(a)
%timeit pd.Series(a[:-2:]).expanding().mean() - pd.Series(a[::-1]).expanding().mean()[-3::-1].reset_index(drop = True)
1 loop, best of 3: 8.17 s per loop
10 loops, best of 3: 18.5 ms per loop
Upvotes: 0
Reputation: 107297
Here is one way using np.cumcum()
:
np.cumsum(a[::-1])[::-1] - np.cumsum(a)
np.cumsum()
will generate the sum of previous items and a[::-1])[::-1]
is the sum of next items. So if you want to calculate the average the length of next items will be as np.arange(a.size, 1, -1)
and the length of previous items will be np.arange(1, a.size)
so you can just do:
np.cumsum(a[::-1])[::-1]/np.arange(a.size + 1, 1, -1) - np.cumsum(a)/np.arange(1, a.size + 1)
Demo:
In [53]: a
Out[53]: array([32, 69, 79, 34, 1, 77, 54, 42, 73, 75])
In [54]: np.cumsum(a[::-1])[::-1]/np.arange(a.size + 1 , 1, -1) - np.cumsum(a)/np.arange(1, a.size + 1)
Out[54]:
array([ 16.72727273, -0.1 , -11.66666667, -9. ,
3. , 4.83333333, -0.62857143, -1. ,
-1.88888889, -16.1 ])
Upvotes: 1