Ayoor J. Daves
Ayoor J. Daves

Reputation: 39

Difference in Variance computation

I have Manually computed the variance of two data sets using definitional, computational and normal R expressions.

 set.seed(12345)                        
 n <- 1e7                             
 df <- tibble(
   small = rnorm(n, mean=100, sd=1),
   large = rnorm(n, mean=1e8, sd=1)
 )

#Definitional
varFuncd <- function(x) {
  x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))] 
  sum((x-mean(x))^2) / (length(x)-1)
}

#Computational 
varFuncc <- function(x){
  x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))]
  (sum(x^2) - (sum(x)^2)/length(x))/(length(x)-1)
}

but the variance of the Large column produces an expected large result (1.6). Please what might be the reason?

My response is:

All the definitional expressions produced an expected variance of 1. However, the computation expression for "Large" produced a higher variance. Definitional expression produces a square of the difference - which translates to squaring relatively small values and produce more efficient results. While Computational uses difference of squares, when the underlying values are large, the difference of squares produces a less efficient result, because squaring large numbers produces super large figure s that become inefficient when divided by n-1.

Upvotes: 0

Views: 93

Answers (1)

Paul Raff
Paul Raff

Reputation: 93

I agree you're running into numerical stability problems as R uses double floating-point numbers for numeric. From Wikipedia when discussing the specific representation for variance that you are using for varFuncc:

This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.

Upvotes: 1

Related Questions