statquant
statquant

Reputation: 14400

How can I speed up this row-by-row operation in data.table

I have a data.table with xe5 rows and approx 100 columns. I am looking to find the first 3 column index such that the value is not NA or 0.

m <- matrix(rep(NA_integer_, 1e6), ncol=10)
for(i in 1:nrow(m)){
    set.seed(i);
    m[i, sample(1:10, 5)] =  1L:5L
}
DT <- data.table(m);
DT
        V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
     1: NA  5  1  2  3 NA  4 NA NA  NA
     2: NA  1 NA NA  3  5  2 NA NA   4
     3: NA  1  4  3 NA NA NA  2  5  NA
     4:  2  4  3 NA  5  1 NA NA NA  NA
     5:  5  4  1 NA NA NA  2  3 NA  NA
    ---                               
 99996: NA NA  2  3  5  1 NA NA  4  NA
 99997:  2 NA NA NA  1 NA NA  3  5   4
 99998:  5 NA  4  2 NA  1  3 NA NA  NA
 99999: NA  5 NA  1 NA  4 NA  2 NA   3
100000:  5 NA NA NA  2  3  1 NA NA   4

f <- function(x){return(list(which(!is.na(x) & x!=0L)[1:3L]))}

#Here is what apply do
system.time(test <- apply(m, FUN=f, MAR=1))
utilisateur     système      écoulé 
       1.30        0.00        1.29

I find it very slow, this might not be a task for data.table, I am looking for a fast way of getting this answer (any method is welcome).

Upvotes: 3

Views: 360

Answers (1)

Arun
Arun

Reputation: 118889

First, you could use the fact that 0 /0 is NaN which will also give TRUE for is.na. This'll reduce to condition to one !is.na. Second, you can vectorise using which with arr.ind = TRUE that'll give a row and col index. We can use that to split by row and get the first three col values as follows:

system.time(tt <- data.table(which(!is.na(DT[, lapply(.SD, function(x) x/0)]), 
             arr.ind=TRUE), key="row")[, col[1:3], by="row"])
   user  system elapsed
  0.360   0.000   0.359

Edit: an alternative way:

DT <- DT[, lapply(.SD, function(x) !is.na(x/0))]
out <- data.table(matrix(numeric(3e5), ncol=3))
system.time({    
for (i in as.integer(seq_along(DT))) {
    for (j in 1:3) {
        zeros <- .subset2(DT, i) & (out[[j]] == 0)
        out[zeros, names(out)[j] := i]
        DT[zeros, c(names(DT)[i]) := FALSE]
    }
}
})

Not sure if it's the fastest though.

Upvotes: 4

Related Questions