thewal
thewal

Reputation: 75

Computing mean and variance over time per group

I have a data frame with 1,000,000 rows. I would like to calculate mean and variance of Tor overtime for each SID to see if I can predict when Tor is starting to go out of limits. The Low limit is 0.4 and the high limit is 0.7. Below is a small example of my data.

dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01", 
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05", 
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09", 
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13", 
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17", 
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21", 
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25", 
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L, 
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L, 
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L), 
    Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166, 
    0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521, 
    0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259, 
    0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271, 
    0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903, 
    0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L, 
    0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 
    0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L, 
    1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-27L))
head(dat)
#         timestamp SID       Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192    0   1
#2 29-06-2021-06:01 351 0.6394163    0   0
#3 29-06-2021-06:02 304 0.2882821    1   0
#4 29-06-2021-06:03 357 0.9323622    0   1
#5 29-06-2021-06:04 358 0.3686166    1   0
#6 29-06-2021-06:05 302 0.4217556    0   0

I have read up about variance but I can't seem to get my head around it. Any help would be great.

Upvotes: 1

Views: 125

Answers (1)

Zheyuan Li
Zheyuan Li

Reputation: 73385

This is a very good question. You want to compute cumulative mean and cumulative variance of Tor (over time) per SID. Given the volume of your actual dataset, it is appropriate to use online algorithms. See my answer and Benjamin's answer on this topic back in 2018 for algorithmic details. In brief, my contribution is:

cummean <- function (x) cumsum(x) / seq_along(x)

cumvar <- function (x, sd = FALSE) {
  x <- x - x[sample.int(length(x), 1)]
  n <- seq_along(x)
  v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
  if (sd) v <- sqrt(v)
  v
}

The extra work required here, is to apply these functions per SID.

## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]

## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)

## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)

## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar

Here is the result. Don't be surprised by the NaN in variance. The first value is always NaN in each SID. This is just normal (we can only compute variance when there are 2+ data).

## inspection
sorted_dat
#          timestamp SID        Tor Lowt Hit    runmean       runvar
#1  29-06-2021-06:00 301 0.70161919    0   1 0.70161919          NaN
#6  29-06-2021-06:05 302 0.42175565    0   0 0.42175565          NaN
#16 29-06-2021-06:15 302 0.44062517    0   0 0.43119041 0.0001780293
#7  29-06-2021-06:06 303 0.40973592    1   0 0.40973592          NaN
#17 29-06-2021-06:16 303 0.63975950    0   0 0.52474771 0.0264554237
#3  29-06-2021-06:02 304 0.28828207    1   0 0.28828207          NaN
#10 29-06-2021-06:09 304 0.81810239    0   1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178    1   0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020    0   0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458    0   1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027    0   1 0.77566027          NaN
#24 29-06-2021-06:23 307 0.97079290    0   1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167    0   0 0.65939167          NaN
#22 29-06-2021-06:21 308 0.24467497    1   0 0.45203332 0.0859949690
#8  29-06-2021-06:07 309 0.94217020    0   1 0.94217020          NaN
#2  29-06-2021-06:01 351 0.63941629    0   0 0.63941629          NaN
#12 29-06-2021-06:11 351 0.24638798    1   0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184    1   0 0.05000184          NaN
#23 29-06-2021-06:22 352 0.38620466    1   0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419    0   1 0.76293419          NaN
#26 29-06-2021-06:25 354 0.19036674    1   0 0.19036674          NaN
#9  29-06-2021-06:08 356 0.38139652    1   0 0.38139652          NaN
#27 29-06-2021-06:26 356 0.56596405    0   0 0.47368029 0.0170325864
#4  29-06-2021-06:03 357 0.93236217    0   1 0.93236217          NaN
#14 29-06-2021-06:13 357 0.63263026    0   0 0.78249621 0.0449196080
#5  29-06-2021-06:04 358 0.36861663    1   0 0.36861663          NaN
#15 29-06-2021-06:14 358 0.66618385    0   0 0.51740024 0.0442731264

Upvotes: 2

Related Questions