Reputation: 23
I have a millions row long data.table
with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.
I need to calculate a opsdiff
column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on column type
. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.
> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))
> dt
type opscounter
1: a 105609
2: a 106316
3: a 106705
4: a 489
5: a 723
6: a 1250
7: b 105609
8: b 106316
9: b 106705
10: b 489
11: b 723
12: b 1250
The result I wish to get:
> dt
type opscounter opsdiff
1: a 105609 NA
2: a 106316 707
3: a 106705 389
4: a 489 489
5: a 723 234
6: a 1250 527
7: b 105609 NA
8: b 106316 707
9: b 106705 389
10: b 489 489
11: b 723 234
12: b 1250 527
>
I tried to build a construction based on ifelse()
but I don't know how to address the current row's opscounter
value in else part and neither the dual call of diff()
is perhaps an efficient usage:
dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]
How can I correctly calculate the opsdiff
column?
Upvotes: 2
Views: 1086
Reputation: 92282
The preferred solution would be:
dt[, opsdiff := c(NA, diff(opscounter)),
by = type][opsdiff < 0, opsdiff := opscounter][]
# type opscounter opsdiff
# 1: a 105609 NA
# 2: a 106316 707
# 3: a 106705 389
# 4: a 489 489
# 5: a 723 234
# 6: a 1250 527
# 7: b 105609 NA
# 8: b 106316 707
# 9: b 106705 389
# 10: b 489 489
# 11: b 723 234
# 12: b 1250 527
Note that I've added additional
[]
in order to print results on the fly and in order to illustrate that you can add more than one of these.
In general, it would be better to avoid ifelse
(especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pull opscounter
from, which adds to the complexity (see @Aruns comment for possible override).
Regarding your question in comments, a data.table
operation of the form DT[...]
just calls the function [.data.table(DT, ...)
. It's not any different on a data.frame
; there is a similar function [.data.frame
.
Note that a
data.table
is also adata.frame
. Seeclass(dt)
and also read?data.table
.
To make it even clearer, in data.table, tacking [...]
one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on a data.frame
are limited, and therefore the use of chaining itself, unlike data.table.
df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by reference
df[df$type == "a", ][2:3, ]
# type opscounter
# 2 a 106316
# 3 a 106705
Finally, in order to illustrate the ineffectiveness of ifelse
, here is a benchmark:
set.seed(123)
n <- 1e6
dt <- data.table(type = rep(c("a","b"), each = n),
opscounter = sample(1:1e5, n*2, replace = TRUE))
library(microbenchmark)
microbenchmark(
dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter],
dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type]
)
# Unit: milliseconds
# expr
# dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff < 0, `:=`(opsdiff, opscounter)]
# dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type]
# min lq mean median uq max neval
# 228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100
# 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100
The ifelse
solution is slower by a factor of ~4.
Upvotes: 2
Reputation: 23574
Since this is not an approach with data.table, this would not be the ideal one. But, the following would be an alternative approach.
library(dplyr)
df <- data.frame(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))
df %>%
group_by(type) %>%
mutate(opsdiff = opscounter - lag(opscounter)) %>%
mutate(opsdiff = ifelse(opsdiff < 0, opscounter, opsdiff))
# type opscounter opsdiff
#1 a 105609 NA
#2 a 106316 707
#3 a 106705 389
#4 a 489 489
#5 a 723 234
#6 a 1250 527
#7 b 105609 NA
#8 b 106316 707
#9 b 106705 389
#10 b 489 489
#11 b 723 234
#12 b 1250 527
Upvotes: 0