Parallel calculation of variance

Question

I am trying to implement a parallel calculation of variance in JavaScript using MapReduce. I believe that this Parallel algorithm could be used, but I cannott figure out how to apply it to an arbitrary number of datasets. So far, I came to the conclusion that the best way to approach the problem is to do a reduction based on the sum of squares instead of doing it against the variance. A naive implementation would look like that:

// partials is an array of [count, sum, sumsquare] arrays
function variance(partials) {
  var count = 0;
  var sum = 0;
  var sumsquare = 0;
  for (var i = 0; i < partials.length; ++i) {
    count += partials[i][0];
    sum += partials[i][1];
    sumsquare += partials[i][2];
  }
  return (sumsquare / count) - Math.pow(sum / count, 2);
}

// variance([[3, 6, 14], [3, 15, 77], [3, 24, 194]]) should return 6.666666666666668

Not being a statistician, I have a hard time figuring out whether such a parallel algorithm would introduce too many compounding errors. But if it is acceptable, it is worth noting that the variance does not need to be computed during the map phase. Only the sum of squares, sum, and count are needed.

KooiInc · Accepted Answer

I'm not sure if I clearly understand what you mean by The reduce function will get an array of quadruplets like { variance, sumsquare, sum, count } for each subset of the entire dataset that was mapped onto a set of workers. Still, based on your code snipped I'd use something like:

Array.sums = function (arr, addarr) {
   var newarr = [0,0,0];
   if (addarr.length === arr.length) {
      arr.forEach( function (v,i) {
        newarr[i] = v + addarr[i];
      });
   }
   return newarr;
}
    
function variance(arr) {
  var summations = arr[0].map(function () {return 0;});
  arr.forEach(function (v){
   summations = Array.sums(v, summations);
  });
  summations.unshift( (summations[2] / summations[0]) -
                      Math.pow(summations[1] / summations[0], 2) );
  // summations is now a quadruplet containing [variance, count, sum, sumsquare]
  return summations;
}

alert( variance([[3, 6, 14], [3, 15, 77], [3, 24, 194]])[0] );

Parallel calculation of variance

Answers (2)

Related Questions