Peter Chen
Peter Chen

Reputation: 1484

data.table way with .SDcols

I have a question for data.table using .SDcols to change.
Here is the example data:

dt
A   B   C   D
XX  XY  ""  ""
ZZ  ZA  ""  ""

What I want is using .SDcols to change "" to NA.

I tried this:

dt[.SD == "", lapply(.SD, is.na), .SDcols = .(A, B, C, D)]  

However, I got Error.

Any help? Appreciate.

Upvotes: 1

Views: 1157

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

Using the more robust method (which handle cases with no NAs) from Frank's comments, below are some timings for info.

library(data.table)
library(microbenchmark)

set.seed(6L)
N <- 1e7
numCols <- 100
pctEmpty <- 0.25
ltrs <- sample(LETTERS, N, replace=TRUE)
ltrs[sample(N, pctEmpty*N)] <- ""
dt <- as.data.table(matrix(ltrs, ncol=numCols))

str(dt)
dt1 <- copy(dt)
dt2 <- copy(dt)

microbenchmark(Replace=dt1[, (names(dt1)) := lapply(.SD, function(x) replace(x, x=="", NA_character_)), .SDcols=names(dt1)],
    Assign=dt2[, (names(dt2)) := lapply(.SD, function(x) { is.na(x) <- x == ""; x }) , .SDcols=names(dt2)],
    times=10L)

# Unit: milliseconds
#     expr      min       lq     mean   median       uq      max neval
#  Replace 234.0141 240.0262 311.2857 268.2718 401.9364 410.1788    10
#   Assign 273.1776 276.4123 344.1861 295.1337 435.8436 449.6495    10

The difference in timings is negligible. And of course, you can play around with the parameters to find the tradeoff depending on your needs.

Upvotes: 2

Related Questions