Reputation: 2077
I'm working with a set of data that looks like the following:
team runs_scored date
LAN 3 2014-03-22
ARI 1 2014-03-22
LAN 7 2014-03-23
ARI 5 2014-03-23
LAN 1 2014-03-30
SDN 3 2014-03-30
I'm trying to test a predictive model on this set and one of the input parameters is the variance of runs_scored
in t-1. In other words, to predict the outcome variable for the 4th observation, I need the variance of LAN
based on the prior observations in the dataset.
I can compute cumulative means and sums, but I can't quite figure out how to compute the cumulative variance in the data set. I'm doing most of my data manipulation in dplyr
, but I'm not opposed to using an alternative solution if it gets me what I need
Upvotes: 2
Views: 1545
Reputation: 7373
Writing out variance formula as, (sum(x^2)-length(x)*mean(x)^2)/(length(x)-1)
, you see that it can be easily generalized to cumulative variances, just by replacing each functions in it by its cumulative versions(cummean
is from dplyr
). So,
library(dplyr)
cum_var <- function(x){
n <- 1:length(x)
(cumsum(x^2)-n*cummean(x)^2)/(n-1)
}
And speed comparison to @MrFlick's cumvar
seems encouraging.
x <- rnorm(1e6)
all.equal(cum_var(x), cumvar(x))
#[1] TRUE
system.time(cumvar(x))[3]
elapsed
5.52
system.time(cum_var(x))[3]
elapsed
0.04
Upvotes: 11
Reputation: 206486
If you want a cumulative variance, you could implement the online-algorithm for variance. The main benefit is that it scales linearly rather than exponential as it would if you iterated over all the possible subsets.
If you have
x<-c(3,1,7,5,1,3)
You can do
cumvar<-function(x) {
tail(Reduce(local({mm<-0; nn<-0; function(a,b)
{nn<<-nn+1; d<-b-mm; mm<<-mm+d/nn; a+d*(b-mm)}}),
x, 0, accumulate=TRUE), -1)/(seq_along(x)-1)
}
cumvar(x)
# [1] NaN 24.500000 14.333333 10.000000 7.700000 6.166667 5.333333 4.696429 4.111111 3.777778
Which returns the same result as
cumvar2 <- function(x) {
sapply(seq_along(x), function(i) var(x[1:i]))
}
cumvar2(x)
# [1] NA 24.500000 14.333333 10.000000 7.700000 6.166667 5.333333 4.696429 4.111111 3.777778
And we can compare efficiencies with
set.seed(15)
x<-rpois(100, 5)
microbenchmark:::microbenchmark(cumvar(x), cumvar2(x))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# cumvar(x) 272.502 297.2425 335.2058 315.490 339.625 957.728 100 a
# cumvar2(x) 1672.323 1793.0960 2089.8104 1865.838 1956.208 6386.863 100 b
But if you want to use this algorithm, I suggest you read the wiki page if you're only calculating variance one, then the two-pass method is more robust.
You could use it with dplyr
with
dd<-read.table(text="team runs_scored date
LAN 3 2014-03-22
ARI 1 2014-03-22
LAN 7 2014-03-23
ARI 5 2014-03-23
LAN 1 2014-03-30
SDN 3 2014-03-30", header=T)
dd %>% mutate(cvar=lag(cumvar(runs_scored)))
# team runs_scored date cvar
# 1 LAN 3 2014-03-22 NA
# 2 ARI 1 2014-03-22 NaN
# 3 LAN 7 2014-03-23 2.000000
# 4 ARI 5 2014-03-23 9.333333
# 5 LAN 1 2014-03-30 6.666667
# 6 SDN 3 2014-03-30 6.800000
Upvotes: 5
Reputation: 880
Do you have a large dataset? If for loops aren't too slow, you can just do this:
data$vars <- NA
for(i in 2:nrow(data)){
data$vars[i] <- var(data$runs_scored[1:(i - 1)])
}
this gives
team runs_scored date vars
1 LAN 3 3/22/2014 NA
2 ARI 1 3/22/2014 NA
3 LAN 7 3/23/2014 2.000000
4 ARI 5 3/23/2014 9.333333
5 LAN 1 3/30/2014 6.666667
6 SDN 3 3/30/2014 6.800000
edit: if you want to do it slightly faster, you can write a specific function for this application:
data$vars <- NA
cumVar <- function(position, df){
return(var(data$runs_scored[1:(position - 1)]))
}
Then use sapply to apply the function and get a vector out:
position <- 3:nrow(data)
results <- c(NA,NA, sapply(position, cumVar,data))
data$var <- results
On my machine, for about 30000 rows, for the for loop, it took about 10.5 seconds, and with sapply, about 7.5 seconds.
Upvotes: 2