Reputation: 10441
I am working in Javascript / React
with an array of objects containing sports data.
Here is an example of the data I am working with:
const mydata = [
{ name: "Tom", year: 2018, statA: 23.2, statB: 12.3 },
{ name: "Bob", year: 2018, statA: 13.2, statB: 10.1 },
{ name: "Joe", year: 2018, statA: 18.2, statB: 19.3 },
{ name: "Tim", year: 2018, statA: 21.1, statB: 21.3 },
{ name: "Jim", year: 2018, statA: 12.5, statB: 32.4 },
{ name: "Nik", year: 2017, statA: 23.6, statB: 23.8 },
{ name: "Tre", year: 2017, statA: 37.8, statB: 18.3 },
{ name: "Ton", year: 2017, statA: 15.3, statB: 12.1 },
{ name: "Bil", year: 2017, statA: 32.2, statB: 41.3 },
{ name: "Geo", year: 2017, statA: 21.5, statB: 39.8 }
];
My data manipulation problem here feels very challenging, and I am struggling. I need to scale (to mean 0, stdev 1), by year, each of several keys in my data (statA, statB).
For example, looking at the values for year === 2018
in the statA column, we have [23.2, 13.2, 18.2, 21.1, 12.5]. As a test, plugging this vector into R's scale() function gives the following:
scale(c(23.2, 13.2, 18.2, 21.1, 12.5))
[,1]
[1,] 1.1765253
[2,] -0.9395274
[3,] 0.1184989
[4,] 0.7321542
[5,] -1.0876511
attr(,"scaled:center")
[1] 17.64
attr(,"scaled:scale")
[1] 4.72578
... so in my original array of objects, the value statA: 23.2 in the first object should be updated as 1.1765, since the value 23.2 is 1.1765 standard deviations above the mean for all other statA values where Year == 2018. In my full dataset, I have ~8K objects and ~50 keys in each object, ~40 of which I need to scale by year.
At a high level, I think I have to (1st) compute the mean and st dev for each stat for each year, and (2nd) use the mean and st dev for that stat for that year, and map it to its scaled value. Performance/speed is important for my app and I'm worried that an ordinary for loop will be very slow, although that's what I'm attempting currently.
Any help with this is appreciated!
EDIT 2:
before I read through / code up on my end, wanted to post what I had finished with yesterday:
const scaleCols = ['statA', 'statB'];
const allYears = [...new Set(rawData.map(ps => ps.Year))];
// loop over each year of the data
for(var i = 0; i < allYears.length; i++) {
// compute sums and counts (for mean calc)
thisYearsArray = rawData.filter(d => d.Year === allYears[i])
sums = {}, counts = {};
for(var j = 0; j < thisYearsArray.length; j++) {
for(var k = 0; k < scaleCols.length; k++) {
if(!(scaleCols[k] in sums)) {
sums[scaleCols[k]] = 0;
counts[scaleCols[k]] = 0;
}
sums[scaleCols[k]] += thisYearsArray[j][scaleCols[k]];
counts[scaleCols[k]] += 1;
}
}
console.log('sums', sums)
console.log('counts', counts)
}
... like i said not very good.
Edit: Would using d3's scale functions help with this?
Upvotes: 1
Views: 694
Reputation: 21578
Although I consider myself an admirer of d3, I think adding the d3 tag to this question was more of a red herring. The other two answers are perfectly fine in that they yield the correct results, but will fall behind when it comes to performance. Since this was a major aspect of your question I would like to add my own two cents to this. I think it might be helpful to implement the calculations yourself sticking to Vanilla-JS.
Looking at the implementation of d3.deviation()
one notices that it is just a thin wrapper around d3.variance()
calculating the square root of the variance. Examining the implementation of the latter brings two things to mind:
The code employs a safeguard to protect against undefined
and NaN
values:
This method ignores undefined and NaN values; this is useful for ignoring missing data.
If you can be sure there are no missing values in your data you can safely get rid of these expensive checks.
While calculating the variance the mean is calculated as a side-effect:
delta = value - mean;
mean += delta / ++m;
sum += delta * (value - mean);
You can use this to return both the variance as well as the mean after a single loop through your data.
Furthermore, d3.mean()
also uses the same safeguard against NaN
or undefined
values as d3.variance()
. Calling both methods sequentially does, of course, mean that these checks will also be executed twice for each value.
Borrowing from d3's own implementation a solution to this can be implemented along the following lines:
function meanAndDeviation(values) {
const len = values.length;
let i = 0;
let value;
let mean = 0;
let sum = 0;
while (i<len) {
delta = (value = values[i]) - mean;
mean += delta / ++i;
sum += delta * (value - mean);
}
return { mean, deviation: Math.sqrt(sum / (i - 1))};
}
Have a look at the following demo:
function meanAndDeviation(values) {
const len = values.length;
let i = 0;
let value;
let mean = 0;
let sum = 0;
while (i<len) {
delta = (value = values[i]) - mean;
mean += delta / ++i;
sum += delta * (value - mean);
}
return { mean, deviation: Math.sqrt(sum / (i - 1))};
}
const arr = [23.2, 13.2, 18.2, 21.1, 12.5];
const {mean, deviation} = meanAndDeviation(arr);
const result = arr.map(d => (d - mean) / deviation);
console.log(result);
Agreed, the destructuring of the returned object is not the most performant part of the code but since it is called only once I like it for its readability.
Upvotes: 1
Reputation: 102198
As a D3 programmer I'm glad to see the other answer using a D3 scale (specially because the question was not originally tagged with d3.js). However, as the answerer already hinted, you don't need a D3 scale here, which is an overkill.
All you need is (value - mean) / deviation
:
var result = arr.map(d => (d - mean) / deviation);
Here is the demo:
var arr = [23.2, 13.2, 18.2, 21.1, 12.5];
var deviation = d3.deviation(arr)
var mean = d3.mean(arr)
var result = arr.map(d => (d - mean) / deviation);
console.log(result)
<script src="https://d3js.org/d3.v5.min.js"></script>
Besides that, two considerations:
for
loop regarding performance. So, what you call an ordinary loop is normally the fastest solution.Upvotes: 3
Reputation: 3458
You can achieve same result (as R's scale) creating d3's continuous scale. See snippet below.
var arr = [23.2, 13.2, 18.2, 21.1, 12.5];
var deviation = d3.deviation(arr)
var mean = d3.mean(arr)
var scale = d3.scaleLinear()
.domain([mean-deviation, mean+deviation])
.range([-1, 1]);
var result = arr.map(el => scale(el));
console.log(result)
<script src="https://d3js.org/d3.v5.min.js"></script>
Upvotes: 2