Reputation: 24623
This is related to R- view all the columns names with any NA
I compared data.frame and data.table versions and found that data.table is 10x slower. This is contrary to most code with data.table which are really much faster than data.frame versions.
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() { setDT(df1); names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
expr min lq median uq max neval
f1() 1.00000 1.00000 1.000000 1.000000 1.000000 100
f2() 10.56342 10.20919 9.996129 9.967001 7.199539 100
setDT beforehand:
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)
library(microbenchmark)
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() {names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
expr min lq median uq max neval
f1() 1.00000 1.00000 1.00000 1.00000 1.000000 100
f2() 10.64642 10.77769 10.79191 10.77536 7.716308 100
What could be the reason?
Upvotes: 3
Views: 1492
Reputation: 115515
data.table
in this case will not provide any magical speed up.
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 8.350364 8.146091 6.966839 5.766292 4.595742 10
For comparison, on my machine the timings are above.
In the "data.frame" approach, you are really just using the fact that a data.frame
is a list and iterating over the list.
In the data.table
approach you are doing the same thing, however by using .SD
, you are forcing the whole data.table to be copied (to make the data available). This is a consequence of data.table
being clever in only copying the data you need into the j
expression. By using .SD, you are copying everything in.
The best approach to improving performance would be to use anyNA
which is a faster (Primitive) approach to finding any NA values (it will stop once it has found the first, instead of creating the whole is.na vector and then scanning for any TRUE values)
For a more bespoke test you might need to write (Rcpp sugar style) function
You will also find that unlist(lapply(...))
will generally be faster than sapply
.
f3 <- function() names(df1)[unlist(lapply(df1, anyNA))]
f4 <- function() names(df1)[sapply(df1, anyNA)]
microbenchmark(f1(), f2(),f3() ,f4(),unit="relative",times=10)
# Unit: relative
# expr min lq median uq max neval
# f1() 10.988322 11.200684 11.048738 10.697663 13.110318 10
# f2() 92.915256 92.000781 91.000729 88.421331 103.627198 10
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f4() 1.591301 1.663222 1.650136 1.652701 2.133943 10
and with Martin Morgan's suggestion
f3.1 <- function() names(df1)[unlist(lapply(df1, anyNA),use.names=FALSE)]
microbenchmark(f1(), f2(),f3() ,f3.1(),f4(),unit="relative",times=10)
# Unit: relative
# expr min lq median uq max neval
# f1() 18.125295 17.902925 18.17514 18.410682 9.2177043 10
# f2() 147.914282 145.805223 145.05835 143.630573 81.9495460 10
# f3() 1.608688 1.623366 1.66078 1.648530 0.8257108 10
# f3.1() 1.000000 1.000000 1.00000 1.000000 1.0000000 10
# f4() 2.555962 2.553768 2.60892 2.646575 1.3510561 10
Upvotes: 9