Conditional difference calculation in data table

I have a millions row long data.table with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.

I need to calculate a opsdiff column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on column type. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.

> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))
> dt
    type opscounter
 1:    a     105609
 2:    a     106316
 3:    a     106705
 4:    a        489
 5:    a        723
 6:    a       1250
 7:    b     105609
 8:    b     106316
 9:    b     106705
10:    b        489
11:    b        723
12:    b       1250

The result I wish to get:

> dt
    type opscounter     opsdiff
 1:    a     105609          NA
 2:    a     106316         707
 3:    a     106705         389
 4:    a        489         489
 5:    a        723         234
 6:    a       1250         527
 7:    b     105609          NA
 8:    b     106316         707
 9:    b     106705         389
10:    b        489         489
11:    b        723         234
12:    b       1250         527
> 

I tried to build a construction based on ifelse() but I don't know how to address the current row's opscounter value in else part and neither the dual call of diff() is perhaps an efficient usage:

dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]

How can I correctly calculate the opsdiff column?

Upvotes: 2

Views: 1086

Answers (2)

David Arenburg
David Arenburg

Reputation: 92282

The preferred solution would be:

dt[, opsdiff := c(NA, diff(opscounter)), 
   by = type][opsdiff < 0, opsdiff := opscounter][]
#     type opscounter opsdiff
#  1:    a     105609      NA
#  2:    a     106316     707
#  3:    a     106705     389
#  4:    a        489     489
#  5:    a        723     234
#  6:    a       1250     527
#  7:    b     105609      NA
#  8:    b     106316     707
#  9:    b     106705     389
# 10:    b        489     489
# 11:    b        723     234
# 12:    b       1250     527

Note that I've added additional [] in order to print results on the fly and in order to illustrate that you can add more than one of these.

In general, it would be better to avoid ifelse (especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pull opscounter from, which adds to the complexity (see @Aruns comment for possible override).

Regarding your question in comments, a data.table operation of the form DT[...] just calls the function [.data.table(DT, ...). It's not any different on a data.frame; there is a similar function [.data.frame.

Note that a data.table is also a data.frame. See class(dt) and also read ?data.table.

To make it even clearer, in data.table, tacking [...] one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on a data.frame are limited, and therefore the use of chaining itself, unlike data.table.

df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by reference
df[df$type == "a", ][2:3, ]
#   type opscounter
# 2    a     106316
# 3    a     106705

Finally, in order to illustrate the ineffectiveness of ifelse, here is a benchmark:

set.seed(123)
n <- 1e6
dt <- data.table(type = rep(c("a","b"), each = n), 
                 opscounter = sample(1:1e5, n*2, replace = TRUE))


library(microbenchmark)
microbenchmark(
dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter],
dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type]
)

# Unit: milliseconds
#                                                                                                         expr
#             dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff <  0, `:=`(opsdiff, opscounter)]
# dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type]
#      min       lq      mean    median        uq       max neval
# 228.0445 255.4006  285.8163  281.1388  307.4195  508.3841   100
# 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704   100

The ifelse solution is slower by a factor of ~4.

Upvotes: 2

jazzurro
jazzurro

Reputation: 23574

Since this is not an approach with data.table, this would not be the ideal one. But, the following would be an alternative approach.

library(dplyr)
df <- data.frame(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))

df %>%
    group_by(type) %>%
    mutate(opsdiff = opscounter - lag(opscounter)) %>%
    mutate(opsdiff = ifelse(opsdiff < 0, opscounter, opsdiff))

#   type opscounter opsdiff
#1     a     105609      NA
#2     a     106316     707
#3     a     106705     389
#4     a        489     489
#5     a        723     234
#6     a       1250     527
#7     b     105609      NA
#8     b     106316     707
#9     b     106705     389
#10    b        489     489
#11    b        723     234
#12    b       1250     527

Upvotes: 0

Related Questions