ajsc
ajsc

Reputation: 11

Pairwise differences between rolling average values for different window widths

This code aims to compute the rolling average of a signal for a range of window widths (i.e. how many points are averaged), and then calculate the sum of all pairwise differences between averages at each window width.

import numpy as np

testData = np.random.normal(0,1,10000)

windowWidths = np.arange(1,1000)
sliding_averages = []
diffs = []

for windowWidth in windowWidths:
    # Rolling average using convolution
    sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth, mode='valid') 
    
    # All pairwise differences
    pairwiseDiffs = (sliding_average[:,None] - sliding_average[None,:]) 

    # Define mask to extract only one corner of difference matrix
    mask = np.triu(np.ones_like(pairwiseDiffs, dtype=bool), k=1) 
    pairwiseDiffs = pairwiseDiffs[mask]

    pairwiseDiffsSqrd = pairwiseDiffs**2
    diffs.append(np.sum(pairwiseDiffsSqrd))

This is aiming to be a component in reproducing the computation described in this section of a paper:

Calculation:

Calculation

My question is whether there is a more efficient way to run this calculation.

I have tried vectorizing further and replacing convolution steps with other approaches, but am not confident in the result.

Upvotes: 1

Views: 69

Answers (2)

ajsc
ajsc

Reputation: 11

After I realised my mistake, this was the code I came up with. There are python libraries that achieve the same thing (e.g. AllanTools) but I wanted to make sure it was doing what I expected and develop my understanding.

def allanVariance(data):
    
    windowWidths = np.arange(1,int(len(data)/2), 2) #Can reduce sampling rate to speed up.
    diffs = []
    
    for windowWidth in windowWidths:
        # Rolling average at each window width
        sliding_average = np.convolve(data, np.ones(windowWidth) / windowWidth, mode='valid') 
        
        consecDiffs = sliding_average[windowWidth:] - sliding_average[:-windowWidth] 
        consecDiffs = np.array(consecDiffs)
    
        DiffsSqrd = consecDiffs**2
        diffs.append(np.sum(DiffsSqrd))
       
    diffs = (np.array(diffs))

    N = len(data)
    
    # Allan variance calculation (as expressed in Gattinger et al. Optics Express, Vol. 30, Issue 4, pp. 5222-5254, among others)
    allnVar = np.sqrt((1/(2*(N-(2*windowWidths)-1))) * diffs)
    
    return allnVar 

Upvotes: 0

Mantas Kandratavičius
Mantas Kandratavičius

Reputation: 1508

While I can't fix any further bugs that you might have found during profiling, I can answer your original question regarding the performance issues.


Looking at your code it seems like you're interested in the squared differences which is coincidentally similar to the mathematical definition of variance. If we do shift around the variables in how variance is being calculated we can achieve a formula where the squares are calculated first and only then the difference:

enter image description here

Now looking back at your code:

import numpy as np

testData = np.random.normal(0, 1, 10000)

windowWidths = np.arange(1, 1000)
sliding_averages = []
diffs = []

for windowWidth in windowWidths:
    # Rolling average using convolution
    sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth,
                                  mode='valid')

    # # All pairwise differences
    # pairwiseDiffs = (sliding_average[:, None] - sliding_average[None, :])
    #
    # # Define mask to extract only one corner of difference matrix
    # mask = np.triu(np.ones_like(pairwiseDiffs, dtype=bool), k=1)
    # pairwiseDiffs = pairwiseDiffs[mask]
    #
    # pairwiseDiffsSqrd = pairwiseDiffs ** 2
    # diffs.append(np.sum(pairwiseDiffsSqrd))

The commented out part should behave the same way no matter whether we calculate the differences first or the square first:

sum_sq = np.sum(sliding_average ** 2)
sum_ = np.sum(sliding_average)
sum_sq_diff = sum_sq - sum_ * sum_
diffs.append(sum_sq_diff)

However we cannot forget that population variance formula divides the sum by the number of observations and that is still completely missing from our code so we add that back in to get:

import numpy as np

np.random.seed(0)
testData = np.random.normal(0, 1, 10000)
windowWidths = np.arange(1, 1000)

sliding_averages = []
diffs = []

for windowWidth in windowWidths:
    # Rolling average using convolution
    sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth,
                                  mode='valid')

    sum_sq = np.sum(sliding_average ** 2)
    sum_ = np.sum(sliding_average)
    sum_sq_diff = len(sliding_average) * sum_sq - sum_ * sum_
    diffs.append(sum_sq_diff)

As expected, using set seed, this returns the same diffs list with sub-second execution time on my machine. Avoiding masking or matrix operations altogether is often possible when working with statistical calculations and will almost always be the fastest possible way to compute the result.

Upvotes: 0

Related Questions