Reputation: 75
I have a data frame with 1,000,000 rows. I would like to calculate mean and variance of Tor
overtime for each SID
to see if I can predict when Tor
is starting to go out of limits. The Low limit is 0.4 and the high limit is 0.7. Below is a small example of my data.
dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01",
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05",
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09",
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13",
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17",
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21",
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25",
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L,
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L,
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L),
Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166,
0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521,
0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259,
0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271,
0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903,
0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-27L))
head(dat)
# timestamp SID Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192 0 1
#2 29-06-2021-06:01 351 0.6394163 0 0
#3 29-06-2021-06:02 304 0.2882821 1 0
#4 29-06-2021-06:03 357 0.9323622 0 1
#5 29-06-2021-06:04 358 0.3686166 1 0
#6 29-06-2021-06:05 302 0.4217556 0 0
Timestamp
is when sample is recorded
SID
is the ID of the part taking the reading. These values can be 301 - 310 and 351 to 360
Tor
is the actual reading, and its data type is <dbl>
.
Lowt
is a binary variable showing that the Tor
reading is below the lower limit.
Hit
is a binary variable showing that the Tor
reading is below the upper limit.
I have read up about variance but I can't seem to get my head around it. Any help would be great.
Upvotes: 1
Views: 125
Reputation: 73385
This is a very good question. You want to compute cumulative mean and cumulative variance of Tor
(over time) per SID
. Given the volume of your actual dataset, it is appropriate to use online algorithms. See my answer and Benjamin's answer on this topic back in 2018 for algorithmic details. In brief, my contribution is:
cummean <- function (x) cumsum(x) / seq_along(x)
cumvar <- function (x, sd = FALSE) {
x <- x - x[sample.int(length(x), 1)]
n <- seq_along(x)
v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
if (sd) v <- sqrt(v)
v
}
The extra work required here, is to apply these functions per SID.
## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]
## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)
## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)
## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar
Here is the result. Don't be surprised by the NaN in variance. The first value is always NaN in each SID. This is just normal (we can only compute variance when there are 2+ data).
## inspection
sorted_dat
# timestamp SID Tor Lowt Hit runmean runvar
#1 29-06-2021-06:00 301 0.70161919 0 1 0.70161919 NaN
#6 29-06-2021-06:05 302 0.42175565 0 0 0.42175565 NaN
#16 29-06-2021-06:15 302 0.44062517 0 0 0.43119041 0.0001780293
#7 29-06-2021-06:06 303 0.40973592 1 0 0.40973592 NaN
#17 29-06-2021-06:16 303 0.63975950 0 0 0.52474771 0.0264554237
#3 29-06-2021-06:02 304 0.28828207 1 0 0.28828207 NaN
#10 29-06-2021-06:09 304 0.81810239 0 1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178 1 0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020 0 0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458 0 1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027 0 1 0.77566027 NaN
#24 29-06-2021-06:23 307 0.97079290 0 1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167 0 0 0.65939167 NaN
#22 29-06-2021-06:21 308 0.24467497 1 0 0.45203332 0.0859949690
#8 29-06-2021-06:07 309 0.94217020 0 1 0.94217020 NaN
#2 29-06-2021-06:01 351 0.63941629 0 0 0.63941629 NaN
#12 29-06-2021-06:11 351 0.24638798 1 0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184 1 0 0.05000184 NaN
#23 29-06-2021-06:22 352 0.38620466 1 0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419 0 1 0.76293419 NaN
#26 29-06-2021-06:25 354 0.19036674 1 0 0.19036674 NaN
#9 29-06-2021-06:08 356 0.38139652 1 0 0.38139652 NaN
#27 29-06-2021-06:26 356 0.56596405 0 0 0.47368029 0.0170325864
#4 29-06-2021-06:03 357 0.93236217 0 1 0.93236217 NaN
#14 29-06-2021-06:13 357 0.63263026 0 0 0.78249621 0.0449196080
#5 29-06-2021-06:04 358 0.36861663 1 0 0.36861663 NaN
#15 29-06-2021-06:14 358 0.66618385 0 0 0.51740024 0.0442731264
Upvotes: 2