Dynamically passing a variable to the `i` expression in data.table

Question

SO #24833247 covers nearly all the use cases for passing column names dynamically to a data.table within a function. However it misses one I'm currently trying to address: passing variables to the i expression.

I'm trying to refactor some data cleansing code to a function that converts certain values to NA after I've pulled the data into a data.table

For example, given the following:

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
dt[colA %in% c('~', ''), colA := NA]
dt[colB %in% c('~', ''), colB := NA]

I want a generic function that replaces the '~', '?' and '' values with NA, instead of having to explicitly code each transformation.

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
clearCol(dt, colA)
clearCol(dt, colB)

The j expression is straight-forward

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[,(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '',)), colA)[]
    x
1: NA
2: NA
3: NA
4: NA
5: NA

However, extending it to add the variable to the i expression fails:

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
 Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

Swapping to this seems to work, but the lack of output with verbose = TRUE (compared to the hard-coded method at the top) leaves me concerned that it will not scale well when given the large data sets I'm working with

clearCol <- function(dt, f) {
  f = deparse(substitute(f))
  dt[get(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
   colA
1:    A
2:    b
3:   NA
4:    d
5:   NA

Is there another way of doing what I want?

Frank · Accepted Answer

You can follow FAQ 1.6 to get the verbose output:

cc = function(d, col, vs = c("~", ""), verb = FALSE){
  col = substitute(col)
  ix  = substitute(col %in% vs)
  d[eval(ix), as.character(col) := NA, verbose = verb ][]
}

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
cc(dt, colA, verb = TRUE)

which gives

Creating new index 'colA'
Starting bmerge ...done in 0 secs
Detected that j uses these columns:  
Assigning to 2 row subset of 5 rows
Dropping index 'colA' due to update on 'colA' (column 1)
   colA colB
1:    A     
2:    b    ?
3:   NA   a1
4:    d   a2
5:   NA   z4

However, notice what the verbose output is saying here. It's creating an index (assuming you didn't do something to create it already, which seems likely since the data was only just read in)... and then it's removing that index (since it is invalidated by the edit to the column). That hardly sounds like something that would do much to contribute to efficiency.

If you really want to do this efficiently, there are a couple options:

Use na.strings when reading the data in
Use set if you have a ton of columns and somehow can't do #1

Dynamically passing a variable to the `i` expression in data.table

Answers (1)

Related Questions