Efficient NumPy way of rolling nanstd

Question

I have a 1-D NumPy array where I create a rolling window and then compute the np.nanstd:

import numpy as np

def rolling_window(a, window):
    a = np.asarray(a)
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)

    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

if __name__ == "__main__":
    n = 100_000_000
    nan_indices = np.random.choice(np.arange(n), size=1000, replace=False)
    T = np.random.rand(n)
    T[nan_indices] = np.nan
    m = 50
    np.nanstd(rolling_window(T, m), axis=T.ndim)

However, I noticed that not only is this extremely time consuming, it also uses a lot of memory. Is there a way to improve both the memory and speed performance (Numba is an option)?

Divakar · Accepted Answer

NumPy vectorized

After grueling through the math, here's what I ended up with few np.convolve and some masking to get a vectorized NumPy solution -

def nanstd(a, W):
    k = np.ones(W, dtype=int)
    m = ~np.isnan(a)
    a0 = np.where(m, a,0)
    
    n = np.convolve(m,k,'valid')
    c1 = np.convolve(a0, k,'valid')
    f2 = c1**2
    p2 = f2/n**2
    f1 = np.convolve((a0**2)*m,k,'valid')+n*p2
    out = np.sqrt((f1 - (2/n)*f2)/n)
    return out

Complete Explanation is at the end of this post.

Pandas equivalent

Here's the equivalent pandas version, which isn't too bad on performance -

import pandas as pd

def pdroll(T,m):
    return pd.Series(T).rolling(m).std(ddof=0).values[m-1:]

Benchmarking

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.

def setup(n):
    nan_indices = np.random.choice(np.arange(n), size=10, replace=False)
    T = np.random.rand(n)
    T[nan_indices] = np.nan
    return T

import benchit
f = {'rolling': lambda T,m: np.nanstd(rolling_window(T, m), axis=T.ndim),
     'pdroll': pdroll, 'conv':nanstd}
in_={(n,w):(setup(n),w) for n in 10**np.arange(2,6) for w in [5,10,20,50,80,100]}
t = benchit.timings(f, in_, multivar=True)
t.plot(logx=True, sp_ncols=2, save='timings.png', dpi=200)

NumPy one is good on smaller window sizes, while pandas is better on larger ones.

NumPy vectorized : Explanation on NumPy version nanstd

Basically, np.nanstd is computing std ignoring NaNs. Now, std can be computed based on mean.

Thus, for an array a with no NaNs, it would be :

np.sqrt(np.mean((a-np.mean(a))**2))   # (1)

Let's prove it :

In [43]: a = np.arange(1,6).astype(float)

In [44]: np.nanstd(a)
Out[44]: 1.4142135623730951

In [45]: np.sqrt(np.mean((a-np.mean(a))**2))
Out[45]: 1.4142135623730951

Now, let's say, we have a NaN in it :

In [46]: a[2] = np.nan

In [47]: a
Out[47]: array([ 1.,  2., nan,  4.,  5.])

The std with nanstd would be :

In [48]: np.nanstd(a)
Out[48]: 1.5811388300841898

Let's figure out the equivalent one based on (1).

Let's start with (a-np.mean(a))**2.

This one : ?

In [72]: (a-np.mean(a))**2
Out[72]: array([nan, nan, nan, nan, nan])

No!

This one : ?

In [73]: (a0 - np.sum(a0)/n)**2
Out[73]: array([4., 1., 9., 1., 4.])

No! Because a is :

In [76]: a
Out[76]: array([ 1.,  2., nan,  4.,  5.])

We need to make the NaN position one 0.

This one : ?

In [75]: m*((a0 - np.sum(a0)/n)**2)
Out[75]: array([4., 1., 0., 1., 4.])

Yes!

Then, what's np.mean((a-np.mean(a))**2)? It would be, sum of those in [75] divided by n :

In [77]: np.sum(m*((a0-np.sum(a0)/n)**2))/n
Out[77]: 2.5

Hence, the final std value :

In [78]: np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)
Out[78]: 1.5811388300841898

Summarizing :

In [55]: m = ~np.isnan(a)       # (2)
    ...: a0 = np.where(m, a,0)
    ...: n = m.sum()
    ...: out0 = np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)

In [56]: out0
Out[56]: 1.5811388300841898

Next part is incorporating the sliding nature. So, we need to do (2) in a sliding nature. So, the first two steps remains.

Hence, it starts off with :

m = ~np.isnan(a)
a0 = np.where(m, a,0)

But the last two would change, let's see how.

Let's focus on the final step to compute out0. We have :

m*((a0-np.sum(a0)/n)**2)

Then, we compute the summation :

np.sum(m*((a0-np.sum(a0)/n)**2))

We have : (a-b)**2 = a**2 + b**2 - 2*a*b. So, earlier step becomes

np.sum(m*(a0**2 + (np.sum(a0)/n)**2 - 2*a0*np.sum(a0)/n))

Further re-arranging leads to :

np.sum(m*(a0**2 + (np.sum(a0)/n)**2) - np.sum(2*a0*np.sum(a0)/n))

np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - np.sum(2*a0*np.sum(a0)/n)

np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - 2*np.sum(a0*np.sum(a0))/n

np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - (2/n)*np.sum(a0*np.sum(a0)) # (3)

Let's focus on the first two parts for the summation.

Also, let's take a sample case to make things concrete. We will setup two datasets - One for the complete array setup and another a windowed version of the complete one.

Setup :

#=========================== 1. Complete setup
a = np.arange(1,10).astype(float)
a[[2,5]] = np.nan

W = 5
k = np.ones(W, dtype=int)
m_comp = ~np.isnan(a)
a0_comp = np.where(m_comp, a,0)

n_comp = np.convolve(m_comp,k,'valid')
c1 = np.convolve(a0_comp, k,'valid')
c2 = np.convolve((a0_comp**2)*m_comp,k,'valid')

#=========================== 2. Windowed setup
a1 = np.arange(1,6).astype(float)
a1[2] = np.nan
m = ~np.isnan(a1)
a0 = np.where(m, a1,0)
n = m.sum()
out0 = np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)

From windowed setup, we have :

In [51]: np.sum(m*(a0**2 + (np.sum(a0)/n)**2))
Out[51]: 82.0

In [52]: np.sum(m*(a0**2) + m*((np.sum(a0)/n)**2))
Out[52]: 82.0

In [53]: np.sum(m*(a0**2)) + np.sum(m*((np.sum(a0)/n)**2))
Out[53]: 82.0

First summation part :

In [86]: np.sum(m*(a0**2))
Out[86]: 46.0

# complete setup version :
In [87]: c2
Out[87]: array([ 46.,  45.,  90., 154., 219.])

Second summation part :

In [54]: np.sum(m*((np.sum(a0)/n)**2))
Out[54]: 36.0

# complete setup version :
In [55]: n_comp*(c1/n_comp)**2
Out[55]: 
array([ 36.        ,  40.33333333,  85.33333333, 144.        ,
       210.25      ])

The remaining piece of puzzle fromn (3) is :

In [79]: (2/n)*np.sum(a0*np.sum(a0))
Out[79]: 72.0

Let's focus on the meat of it :

In [80]: np.sum(a0*np.sum(a0))
Out[80]: 144.0

On the complete setup, it would correspond to :

In [81]: c1**2
Out[81]: array([144., 121., 256., 576., 841.])

Thus, for the entire remaining piece :

In [82]: (2/n)*np.sum(a0*np.sum(a0))
Out[82]: 72.0

# complete setup version :
In [83]: (2/n_comp)*c1**2
Out[83]: 
array([ 72.        ,  80.66666667, 170.66666667, 288.        ,
       420.5       ])

Hence, (3) and its complete version counterpart would be :

In [89]: np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - (2/n)*np.sum(a0*np.sum(a0))
Out[89]: 10.0

In [90]: c2 + n_comp*(c1/n_comp)**2 - (2/n_comp)*c1**2
Out[90]: array([10.        ,  4.66666667,  4.66666667, 10.        ,  8.75      ])

To get the final std values, we need to divide by the count of valid ones per window and then apply sqrt :

In [99]: np.sqrt((c2 + n_comp*(c1/n_comp)**2 - (2/n_comp)*c1**2)/n_comp)
Out[99]: array([1.58113883, 1.24721913, 1.24721913, 1.58113883, 1.47901995])

Hence, with some cleanup, we end up with the final nanstd version.

Efficient NumPy way of rolling nanstd

Answers (1)

NumPy vectorized

Pandas equivalent

Benchmarking

Related Questions