Reputation: 31
I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
Upvotes: 3
Views: 2235
Reputation: 26943
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
Upvotes: 1
Reputation: 26087
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i
, y_i
, and x_i*y_i
) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Upvotes: 2