Bradley
Bradley

Reputation: 2077

Compute Variance per Period in R

I'm working with a set of data that looks like the following:

team runs_scored       date
LAN           3        2014-03-22
ARI           1        2014-03-22
LAN           7        2014-03-23
ARI           5        2014-03-23
LAN           1        2014-03-30
SDN           3        2014-03-30

I'm trying to test a predictive model on this set and one of the input parameters is the variance of runs_scored in t-1. In other words, to predict the outcome variable for the 4th observation, I need the variance of LAN based on the prior observations in the dataset.

I can compute cumulative means and sums, but I can't quite figure out how to compute the cumulative variance in the data set. I'm doing most of my data manipulation in dplyr, but I'm not opposed to using an alternative solution if it gets me what I need

Upvotes: 2

Views: 1545

Answers (3)

Khashaa
Khashaa

Reputation: 7373

Writing out variance formula as, (sum(x^2)-length(x)*mean(x)^2)/(length(x)-1), you see that it can be easily generalized to cumulative variances, just by replacing each functions in it by its cumulative versions(cummean is from dplyr). So,

library(dplyr)
cum_var <- function(x){
    n <- 1:length(x)
    (cumsum(x^2)-n*cummean(x)^2)/(n-1)
}

And speed comparison to @MrFlick's cumvar seems encouraging.

x <- rnorm(1e6)
all.equal(cum_var(x), cumvar(x))
#[1] TRUE
system.time(cumvar(x))[3]
elapsed 
   5.52 
system.time(cum_var(x))[3]
elapsed 
   0.04 

Upvotes: 11

MrFlick
MrFlick

Reputation: 206486

If you want a cumulative variance, you could implement the online-algorithm for variance. The main benefit is that it scales linearly rather than exponential as it would if you iterated over all the possible subsets.

If you have

x<-c(3,1,7,5,1,3)

You can do

cumvar<-function(x) {
   tail(Reduce(local({mm<-0; nn<-0; function(a,b) 
        {nn<<-nn+1; d<-b-mm; mm<<-mm+d/nn; a+d*(b-mm)}}), 
        x, 0, accumulate=TRUE), -1)/(seq_along(x)-1)
}
cumvar(x)
# [1]       NaN 24.500000 14.333333 10.000000  7.700000  6.166667  5.333333   4.696429  4.111111  3.777778

Which returns the same result as

cumvar2 <- function(x)  {
    sapply(seq_along(x), function(i) var(x[1:i]))
}
cumvar2(x)
# [1]        NA 24.500000 14.333333 10.000000  7.700000  6.166667  5.333333  4.696429  4.111111  3.777778

And we can compare efficiencies with

set.seed(15)
x<-rpois(100, 5)
microbenchmark:::microbenchmark(cumvar(x), cumvar2(x))

# Unit: microseconds
#        expr      min        lq      mean   median       uq      max neval cld
#   cumvar(x)  272.502  297.2425  335.2058  315.490  339.625  957.728   100  a 
#  cumvar2(x) 1672.323 1793.0960 2089.8104 1865.838 1956.208 6386.863   100   b

But if you want to use this algorithm, I suggest you read the wiki page if you're only calculating variance one, then the two-pass method is more robust.

You could use it with dplyr with

dd<-read.table(text="team runs_scored       date
LAN           3        2014-03-22
ARI           1        2014-03-22
LAN           7        2014-03-23
ARI           5        2014-03-23
LAN           1        2014-03-30
SDN           3        2014-03-30", header=T)

dd %>% mutate(cvar=lag(cumvar(runs_scored)))

#   team runs_scored       date     cvar
# 1  LAN           3 2014-03-22       NA
# 2  ARI           1 2014-03-22      NaN
# 3  LAN           7 2014-03-23 2.000000
# 4  ARI           5 2014-03-23 9.333333
# 5  LAN           1 2014-03-30 6.666667
# 6  SDN           3 2014-03-30 6.800000

Upvotes: 5

goodtimeslim
goodtimeslim

Reputation: 880

Do you have a large dataset? If for loops aren't too slow, you can just do this:

data$vars <- NA
for(i in 2:nrow(data)){
  data$vars[i] <- var(data$runs_scored[1:(i - 1)])
}

this gives

  team runs_scored      date     vars
1  LAN           3 3/22/2014       NA
2  ARI           1 3/22/2014       NA
3  LAN           7 3/23/2014 2.000000
4  ARI           5 3/23/2014 9.333333
5  LAN           1 3/30/2014 6.666667
6  SDN           3 3/30/2014 6.800000

edit: if you want to do it slightly faster, you can write a specific function for this application:

data$vars <- NA
cumVar <- function(position, df){
  return(var(data$runs_scored[1:(position - 1)]))
}

Then use sapply to apply the function and get a vector out:

position <- 3:nrow(data)
results <- c(NA,NA, sapply(position, cumVar,data))
data$var <- results

On my machine, for about 30000 rows, for the for loop, it took about 10.5 seconds, and with sapply, about 7.5 seconds.

Upvotes: 2

Related Questions