Reputation: 3565
I am trying to implement a parallel calculation of variance in JavaScript using MapReduce. I believe that this Parallel algorithm could be used, but I cannott figure out how to apply it to an arbitrary number of datasets. So far, I came to the conclusion that the best way to approach the problem is to do a reduction based on the sum of squares instead of doing it against the variance. A naive implementation would look like that:
// partials is an array of [count, sum, sumsquare] arrays
function variance(partials) {
var count = 0;
var sum = 0;
var sumsquare = 0;
for (var i = 0; i < partials.length; ++i) {
count += partials[i][0];
sum += partials[i][1];
sumsquare += partials[i][2];
}
return (sumsquare / count) - Math.pow(sum / count, 2);
}
// variance([[3, 6, 14], [3, 15, 77], [3, 24, 194]]) should return 6.666666666666668
Not being a statistician, I have a hard time figuring out whether such a parallel algorithm would introduce too many compounding errors. But if it is acceptable, it is worth noting that the variance does not need to be computed during the map
phase. Only the sum of squares, sum, and count are needed.
Upvotes: 2
Views: 893
Reputation: 122936
I'm not sure if I clearly understand what you mean by The reduce function will get an array of quadruplets like { variance, sumsquare, sum, count } for each subset of the entire dataset that was mapped onto a set of workers. Still, based on your code snipped I'd use something like:
Array.sums = function (arr, addarr) {
var newarr = [0,0,0];
if (addarr.length === arr.length) {
arr.forEach( function (v,i) {
newarr[i] = v + addarr[i];
});
}
return newarr;
}
function variance(arr) {
var summations = arr[0].map(function () {return 0;});
arr.forEach(function (v){
summations = Array.sums(v, summations);
});
summations.unshift( (summations[2] / summations[0]) -
Math.pow(summations[1] / summations[0], 2) );
// summations is now a quadruplet containing [variance, count, sum, sumsquare]
return summations;
}
alert( variance([[3, 6, 14], [3, 15, 77], [3, 24, 194]])[0] );
Upvotes: 1
Reputation: 3565
As far as I can tell, the "naive" solution that was added to the original question is as good as it gets, in the sense that it relies on three aggregations (count, sum, and sumsquare) that would be required to calculate a variance in one pass anyway, and all it does is to sum the individual aggregations, which would also be needed with a single-pass calculation of the variance. Therefore, it is not adding any arithmetic overhead. As a consequence, it should not add any errors when compared to a centralized calculation.
Upvotes: 0