gogurt
gogurt

Reputation: 831

Averages of slices on a 1d nparray: how to make it more NumPy-thonic?

As part of some simulations I'm running, I need to eventually perform the following operation on some very long sequences of (real) numbers. Here's the gist:

Given a long 1-d NumPy array, for each position in the array I want to average up the values before and after that position, take the difference between the averages, and load those differences into another nparray of the same dimension as the original array.

Here's my attempt. It works perfectly, except that it gets super slow as the sequence gets longer.

import numpy as np                                                                                     

def test_sequence(npseq):                                                                                           
    n = npseq.shape[0]                                                                                   

    def f(i):                                                                                          
        pre = np.sum(npseq[:i])/i                                                                        
        post = np.sum(npseq[(i+1):])/(n-i)                                                               
        return pre-post                                                                                

    out = np.array([f(i) for i in range(1,n)])                                                         

    return out

Seems straightforward enough. But...

In [26]: a = np.random.randint(0,100,100000)
In [27]: %timeit example.test_sequence(a)
1 loops, best of 3: 7.69 s per loop

In [17]: a = np.random.randint(0,100,400000)
In [18]: %timeit example.test_sequence(a)
1 loops, best of 3: 1min 50s per loop

I know that there's probably a smart way to vectorize this, but I'm inexperienced with NumPy. Can anyone point me in the right direction?

EDIT: I originally wrote "sum" instead of "average." I meant "average." My bad. And I know that there might be an off-by-one error there---I'm not concerned about it for now. The actual problem is slightly more complicated than the version I've presented here so I'd need to fiddle with it anyway.

Upvotes: 4

Views: 308

Answers (2)

jeremycg
jeremycg

Reputation: 24945

As an alternative, you can also use the (soon to be deprecated) expanding_mean function from pandas:

import pandas as pd
a = np.array([32, 69, 79, 34,  1, 77, 54, 42, 73, 75])
pd.expanding_mean(a)[:-2:] - pd.expanding_mean(a[::-1])[-3::-1]

The output here matches (I think) a fixed version of your function, but with the off by one errors, you can fix it how you'd like it:

def test_sequence(npseq):                                                                                           
    n = npseq.shape[0]                                                                                   

    def f(i):                                                                                          
        pre = np.sum(npseq[:i])/i
        post = np.sum(npseq[(i+1):])/(n-i-1)
        return pre-post                                                                                

    out = np.array([f(i) for i in range(1,n-1)])                                                         

    return out

test_sequence(a)

array([-22.375     ,  -0.35714286,   6.33333333, -10.7       ,
       -18.        , -14.66666667, -24.57142857, -26.5       ])

The new replacement version should use expanding and mean:

pd.Series(a[:-2:]).expanding().mean() - pd.Series(a[::-1]).expanding().mean()[-3::-1].reset_index(drop = True)

0   -22.375000
1    -0.357143
2     6.333333
3   -10.700000
4   -18.000000
5   -14.666667
6   -24.571429
7   -26.500000
dtype: float64

and some timing:

a = np.random.randint(0,100,100000)
%timeit test_sequence(a)
%timeit pd.Series(a[:-2:]).expanding().mean() - pd.Series(a[::-1]).expanding().mean()[-3::-1].reset_index(drop = True)

1 loop, best of 3: 8.17 s per loop
10 loops, best of 3: 18.5 ms per loop

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107297

Here is one way using np.cumcum():

np.cumsum(a[::-1])[::-1] - np.cumsum(a)

np.cumsum() will generate the sum of previous items and a[::-1])[::-1] is the sum of next items. So if you want to calculate the average the length of next items will be as np.arange(a.size, 1, -1) and the length of previous items will be np.arange(1, a.size) so you can just do:

np.cumsum(a[::-1])[::-1]/np.arange(a.size + 1, 1, -1) - np.cumsum(a)/np.arange(1, a.size + 1)

Demo:

In [53]: a
Out[53]: array([32, 69, 79, 34,  1, 77, 54, 42, 73, 75])

In [54]: np.cumsum(a[::-1])[::-1]/np.arange(a.size + 1 , 1, -1) - np.cumsum(a)/np.arange(1, a.size + 1)
Out[54]: 
array([ 16.72727273,  -0.1       , -11.66666667,  -9.        ,
         3.        ,   4.83333333,  -0.62857143,  -1.        ,
        -1.88888889, -16.1       ])

Upvotes: 1

Related Questions