Reputation: 11
This code aims to compute the rolling average of a signal for a range of window widths (i.e. how many points are averaged), and then calculate the sum of all pairwise differences between averages at each window width.
import numpy as np
testData = np.random.normal(0,1,10000)
windowWidths = np.arange(1,1000)
sliding_averages = []
diffs = []
for windowWidth in windowWidths:
# Rolling average using convolution
sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth, mode='valid')
# All pairwise differences
pairwiseDiffs = (sliding_average[:,None] - sliding_average[None,:])
# Define mask to extract only one corner of difference matrix
mask = np.triu(np.ones_like(pairwiseDiffs, dtype=bool), k=1)
pairwiseDiffs = pairwiseDiffs[mask]
pairwiseDiffsSqrd = pairwiseDiffs**2
diffs.append(np.sum(pairwiseDiffsSqrd))
This is aiming to be a component in reproducing the computation described in this section of a paper:
Calculation:
My question is whether there is a more efficient way to run this calculation.
I have tried vectorizing further and replacing convolution steps with other approaches, but am not confident in the result.
Upvotes: 1
Views: 69
Reputation: 11
After I realised my mistake, this was the code I came up with. There are python libraries that achieve the same thing (e.g. AllanTools) but I wanted to make sure it was doing what I expected and develop my understanding.
def allanVariance(data):
windowWidths = np.arange(1,int(len(data)/2), 2) #Can reduce sampling rate to speed up.
diffs = []
for windowWidth in windowWidths:
# Rolling average at each window width
sliding_average = np.convolve(data, np.ones(windowWidth) / windowWidth, mode='valid')
consecDiffs = sliding_average[windowWidth:] - sliding_average[:-windowWidth]
consecDiffs = np.array(consecDiffs)
DiffsSqrd = consecDiffs**2
diffs.append(np.sum(DiffsSqrd))
diffs = (np.array(diffs))
N = len(data)
# Allan variance calculation (as expressed in Gattinger et al. Optics Express, Vol. 30, Issue 4, pp. 5222-5254, among others)
allnVar = np.sqrt((1/(2*(N-(2*windowWidths)-1))) * diffs)
return allnVar
Upvotes: 0
Reputation: 1508
While I can't fix any further bugs that you might have found during profiling, I can answer your original question regarding the performance issues.
Looking at your code it seems like you're interested in the squared differences which is coincidentally similar to the mathematical definition of variance. If we do shift around the variables in how variance is being calculated we can achieve a formula where the squares are calculated first and only then the difference:
Now looking back at your code:
import numpy as np
testData = np.random.normal(0, 1, 10000)
windowWidths = np.arange(1, 1000)
sliding_averages = []
diffs = []
for windowWidth in windowWidths:
# Rolling average using convolution
sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth,
mode='valid')
# # All pairwise differences
# pairwiseDiffs = (sliding_average[:, None] - sliding_average[None, :])
#
# # Define mask to extract only one corner of difference matrix
# mask = np.triu(np.ones_like(pairwiseDiffs, dtype=bool), k=1)
# pairwiseDiffs = pairwiseDiffs[mask]
#
# pairwiseDiffsSqrd = pairwiseDiffs ** 2
# diffs.append(np.sum(pairwiseDiffsSqrd))
The commented out part should behave the same way no matter whether we calculate the differences first or the square first:
sum_sq = np.sum(sliding_average ** 2)
sum_ = np.sum(sliding_average)
sum_sq_diff = sum_sq - sum_ * sum_
diffs.append(sum_sq_diff)
However we cannot forget that population variance formula divides the sum by the number of observations and that is still completely missing from our code so we add that back in to get:
import numpy as np
np.random.seed(0)
testData = np.random.normal(0, 1, 10000)
windowWidths = np.arange(1, 1000)
sliding_averages = []
diffs = []
for windowWidth in windowWidths:
# Rolling average using convolution
sliding_average = np.convolve(testData, np.ones(windowWidth) / windowWidth,
mode='valid')
sum_sq = np.sum(sliding_average ** 2)
sum_ = np.sum(sliding_average)
sum_sq_diff = len(sliding_average) * sum_sq - sum_ * sum_
diffs.append(sum_sq_diff)
As expected, using set seed, this returns the same diffs list with sub-second execution time on my machine. Avoiding masking or matrix operations altogether is often possible when working with statistical calculations and will almost always be the fastest possible way to compute the result.
Upvotes: 0