Canovice
Canovice

Reputation: 10441

Calculate how many std deviations the values of certain keys are from the mean

I am working in Javascript / React with an array of objects containing sports data.

Here is an example of the data I am working with:

const mydata = [
  { name: "Tom", year: 2018, statA: 23.2, statB: 12.3 },
  { name: "Bob", year: 2018, statA: 13.2, statB: 10.1 },
  { name: "Joe", year: 2018, statA: 18.2, statB: 19.3 },
  { name: "Tim", year: 2018, statA: 21.1, statB: 21.3 },
  { name: "Jim", year: 2018, statA: 12.5, statB: 32.4 },
  { name: "Nik", year: 2017, statA: 23.6, statB: 23.8 },
  { name: "Tre", year: 2017, statA: 37.8, statB: 18.3 },
  { name: "Ton", year: 2017, statA: 15.3, statB: 12.1 },
  { name: "Bil", year: 2017, statA: 32.2, statB: 41.3 },
  { name: "Geo", year: 2017, statA: 21.5, statB: 39.8 }
];

My data manipulation problem here feels very challenging, and I am struggling. I need to scale (to mean 0, stdev 1), by year, each of several keys in my data (statA, statB).

For example, looking at the values for year === 2018 in the statA column, we have [23.2, 13.2, 18.2, 21.1, 12.5]. As a test, plugging this vector into R's scale() function gives the following:

scale(c(23.2, 13.2, 18.2, 21.1, 12.5))

           [,1]
[1,]  1.1765253
[2,] -0.9395274
[3,]  0.1184989
[4,]  0.7321542
[5,] -1.0876511
attr(,"scaled:center")
[1] 17.64
attr(,"scaled:scale")
[1] 4.72578 

... so in my original array of objects, the value statA: 23.2 in the first object should be updated as 1.1765, since the value 23.2 is 1.1765 standard deviations above the mean for all other statA values where Year == 2018. In my full dataset, I have ~8K objects and ~50 keys in each object, ~40 of which I need to scale by year.

At a high level, I think I have to (1st) compute the mean and st dev for each stat for each year, and (2nd) use the mean and st dev for that stat for that year, and map it to its scaled value. Performance/speed is important for my app and I'm worried that an ordinary for loop will be very slow, although that's what I'm attempting currently.

Any help with this is appreciated!

EDIT 2:

before I read through / code up on my end, wanted to post what I had finished with yesterday:

    const scaleCols = ['statA', 'statB'];
    const allYears = [...new Set(rawData.map(ps => ps.Year))];

    // loop over each year of the data
    for(var i = 0; i < allYears.length; i++) {

        // compute sums and counts (for mean calc)
        thisYearsArray = rawData.filter(d => d.Year === allYears[i])
        sums = {}, counts = {};
        for(var j = 0; j < thisYearsArray.length; j++) {
            for(var k = 0; k < scaleCols.length; k++) {
                if(!(scaleCols[k] in sums)) {
                    sums[scaleCols[k]] = 0;
                    counts[scaleCols[k]] = 0;
                }

                sums[scaleCols[k]] += thisYearsArray[j][scaleCols[k]];
                counts[scaleCols[k]] += 1;
            }
        }

        console.log('sums', sums)
        console.log('counts', counts)
    }

... like i said not very good.

Edit: Would using d3's scale functions help with this?

Upvotes: 1

Views: 694

Answers (3)

altocumulus
altocumulus

Reputation: 21578

Although I consider myself an admirer of d3, I think adding the tag to this question was more of a red herring. The other two answers are perfectly fine in that they yield the correct results, but will fall behind when it comes to performance. Since this was a major aspect of your question I would like to add my own two cents to this. I think it might be helpful to implement the calculations yourself sticking to Vanilla-JS.

Looking at the implementation of d3.deviation() one notices that it is just a thin wrapper around d3.variance() calculating the square root of the variance. Examining the implementation of the latter brings two things to mind:

  1. The code employs a safeguard to protect against undefined and NaN values:

    This method ignores undefined and NaN values; this is useful for ignoring missing data.

    If you can be sure there are no missing values in your data you can safely get rid of these expensive checks.

  2. While calculating the variance the mean is calculated as a side-effect:

    delta = value - mean;
    mean += delta / ++m;
    sum += delta * (value - mean);
    

    You can use this to return both the variance as well as the mean after a single loop through your data.

Furthermore, d3.mean() also uses the same safeguard against NaN or undefined values as d3.variance(). Calling both methods sequentially does, of course, mean that these checks will also be executed twice for each value.

Borrowing from d3's own implementation a solution to this can be implemented along the following lines:

function meanAndDeviation(values) {
  const len = values.length;
  let i = 0;
  let value;
  let mean = 0;
  let sum = 0;
  while (i<len) {
    delta = (value = values[i]) - mean;
    mean += delta / ++i;
    sum += delta * (value - mean);
  }

  return { mean, deviation: Math.sqrt(sum / (i - 1))};
}

Have a look at the following demo:

function meanAndDeviation(values) {
  const len = values.length;
  let i = 0;
  let value;
  let mean = 0;
  let sum = 0;
  while (i<len) {
    delta = (value = values[i]) - mean;
    mean += delta / ++i;
    sum += delta * (value - mean);
  }
  
  return { mean, deviation: Math.sqrt(sum / (i - 1))};
}

const arr = [23.2, 13.2, 18.2, 21.1, 12.5];
const {mean, deviation} = meanAndDeviation(arr);

const result = arr.map(d => (d - mean) / deviation);

console.log(result);

Agreed, the destructuring of the returned object is not the most performant part of the code but since it is called only once I like it for its readability.

Upvotes: 1

Gerardo Furtado
Gerardo Furtado

Reputation: 102198

As a D3 programmer I'm glad to see the other answer using a D3 scale (specially because the question was not originally tagged with ). However, as the answerer already hinted, you don't need a D3 scale here, which is an overkill.

All you need is (value - mean) / deviation:

var result = arr.map(d => (d - mean) / deviation);

Here is the demo:

var arr = [23.2, 13.2, 18.2, 21.1, 12.5];
var deviation = d3.deviation(arr)
var mean = d3.mean(arr)

var result = arr.map(d => (d - mean) / deviation);

console.log(result)
<script src="https://d3js.org/d3.v5.min.js"></script>

Besides that, two considerations:

  1. "At a high level, I think I have to (1st) compute the mean and std dev for each stat for each year, and (2nd) use the mean and std dev for that stat for that year": That's correct, you cannot calculate how many standard deviations a value is from the mean before knowing the standard deviation and the mean, which you can only know looping the whole array first. Therefore, you cannot possibly do what you want iterating over the data array less than 2 times.
  2. "Performance/speed is important for my app and I'm worried that an ordinary for loop will be very slow": Things are a bit different now, but until recently nothing would beat a for loop regarding performance. So, what you call an ordinary loop is normally the fastest solution.

Upvotes: 3

barbsan
barbsan

Reputation: 3458

You can achieve same result (as R's scale) creating d3's continuous scale. See snippet below.

var arr = [23.2, 13.2, 18.2, 21.1, 12.5];
var deviation = d3.deviation(arr)
var mean = d3.mean(arr)

var scale = d3.scaleLinear()
   .domain([mean-deviation, mean+deviation])
   .range([-1, 1]);
   
var result = arr.map(el => scale(el));

console.log(result)
   <script src="https://d3js.org/d3.v5.min.js"></script>

Upvotes: 2

Related Questions