why is data.table so slow in this example in R

Question

This is related to R- view all the columns names with any NA

I compared data.frame and data.table versions and found that data.table is 10x slower. This is contrary to most code with data.table which are really much faster than data.frame versions.

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))

library(microbenchmark) 
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() { setDT(df1); names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]]  } 
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
 expr      min       lq   median       uq      max neval
 f1()  1.00000  1.00000 1.000000 1.000000 1.000000   100
 f2() 10.56342 10.20919 9.996129 9.967001 7.199539   100

setDT beforehand:

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)

library(microbenchmark) 
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() {names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]]  } 
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
 expr      min       lq   median       uq      max neval
 f1()  1.00000  1.00000  1.00000  1.00000 1.000000   100
 f2() 10.64642 10.77769 10.79191 10.77536 7.716308   100

What could be the reason?

mnel · Accepted Answer

data.table in this case will not provide any magical speed up.

# Unit: relative
#  expr      min       lq   median       uq      max neval
#  f1() 1.000000 1.000000 1.000000 1.000000 1.000000    10
#  f2() 8.350364 8.146091 6.966839 5.766292 4.595742    10

For comparison, on my machine the timings are above.

In the "data.frame" approach, you are really just using the fact that a data.frame is a list and iterating over the list.

In the data.table approach you are doing the same thing, however by using .SD, you are forcing the whole data.table to be copied (to make the data available). This is a consequence of data.table being clever in only copying the data you need into the j expression. By using .SD, you are copying everything in.

The best approach to improving performance would be to use anyNA which is a faster (Primitive) approach to finding any NA values (it will stop once it has found the first, instead of creating the whole is.na vector and then scanning for any TRUE values)

For a more bespoke test you might need to write (Rcpp sugar style) function

You will also find that unlist(lapply(...)) will generally be faster than sapply.

f3 <- function() names(df1)[unlist(lapply(df1, anyNA))]
f4 <- function() names(df1)[sapply(df1, anyNA)]
microbenchmark(f1(), f2(),f3() ,f4(),unit="relative",times=10)

# Unit: relative
# expr       min        lq    median        uq        max neval
# f1() 10.988322 11.200684 11.048738 10.697663  13.110318    10
# f2() 92.915256 92.000781 91.000729 88.421331 103.627198    10
# f3()  1.000000  1.000000  1.000000  1.000000   1.000000    10
# f4()  1.591301  1.663222  1.650136  1.652701   2.133943    10

and with Martin Morgan's suggestion

f3.1 <- function() names(df1)[unlist(lapply(df1, anyNA),use.names=FALSE)]

 microbenchmark(f1(), f2(),f3() ,f3.1(),f4(),unit="relative",times=10)
# Unit: relative
#    expr        min         lq    median         uq        max neval
#    f1()  18.125295  17.902925  18.17514  18.410682  9.2177043    10
#    f2() 147.914282 145.805223 145.05835 143.630573 81.9495460    10
#    f3()   1.608688   1.623366   1.66078   1.648530  0.8257108    10
#  f3.1()   1.000000   1.000000   1.00000   1.000000  1.0000000    10
#    f4()   2.555962   2.553768   2.60892   2.646575  1.3510561    10

why is data.table so slow in this example in R

Answers (1)

Related Questions