skan
skan

Reputation: 7720

data.table. Fast way to count number of changes within every column

I would like to know how many times each variable changes within each group and later add the result for all groups.

I've found this way:

mi[,lapply(.SD, function(x) sum(x != shift(x), 
  na.rm=T) ), by = ID][,-1][,lapply(.SD,sum, na.rm=T)]

It works, it produces the proper result but it's really slow in my large datatable. I would like to do both operations inside the same lapply (or something faster and more compact), but the first one is done by group, the second isn't.

It could be written in an easier way (maybe not always)

mi[,lapply(.SD, function(x) sum(x != shift(x), 
    na.rm=T) )] [,-1]-mi[,length(unique(ID))]+1

But it's still slow and needs a lot of memory.

Any other idea?

I've also tried with diffs instead of shift, but it becomes more difficult.

Here you have a dummy example:

mi <- data.table(ID=rep(1:3,each=4) , year=rep(1:4, times=3),
   VREP=rep(1:3,each=4) , VDI=rep(1:4, times=3), RAN=sample(12))
mi <- rbind(mi, data.table(4,1,1,1,0), use.names=F)

Big example for benchmark

mi <- as.data.table(matrix(sample(0:100,10000000,
 replace=T), nrow=100000, ncol=100))
mi[,ID := rep(1:1000,each=100)]

My problem is that the true dataset is much bigger, it's in the limit of memory size, then I've configured R to be able to use more memory using the pagefile, and it makes many operations slow. I know I could do it splitting the file and joining it again, but sometimes that makes things more difficult or some operations are not splittable.

Upvotes: 3

Views: 338

Answers (1)

eddi
eddi

Reputation: 49448

Your second method produces incorrect results, so is not a fair comparison point. Here's an optimized version of alexis_laz's suggestion instead:

setorder(mi, ID)
setDT(Map(`!=`, mi, shift(mi)))[,
  lapply(lapply(.SD, `&`, !ID), sum, na.rm = T), .SDcols = -"ID"]
#   year VREP VDI RAN
#1:    9    0   9   9

On your bigger sample:

setorder(mi, ID)
microbenchmark(method1(), alexis_laz(), eddi(), times = 5)
#Unit: milliseconds
#         expr       min        lq      mean    median        uq      max neval
#    method1() 7336.1830 7510.9543 7932.0476 8150.3197 8207.2181 8455.563     5
# alexis_laz() 1350.0338 1492.3793 1509.0790 1492.5426 1577.3318 1633.107     5
#       eddi()  400.3999  475.6908  494.5805  504.6163  524.2077  567.988     5

Upvotes: 2

Related Questions