Reputation: 93
SO #24833247 covers nearly all the use cases for passing column names dynamically to a data.table
within a function. However it misses one I'm currently trying to address: passing variables to the i
expression.
I'm trying to refactor some data cleansing code to a function that converts certain values to NA
after I've pulled the data into a data.table
For example, given the following:
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
dt[colA %in% c('~', ''), colA := NA]
dt[colB %in% c('~', ''), colB := NA]
I want a generic function that replaces the '~'
, '?'
and ''
values with NA
, instead of having to explicitly code each transformation.
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
clearCol(dt, colA)
clearCol(dt, colB)
The j
expression is straight-forward
clearCol <- function(dt, f) {
f = substitute(f)
dt[,(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '',)), colA)[]
x
1: NA
2: NA
3: NA
4: NA
5: NA
However, extending it to add the variable to the i
expression fails:
clearCol <- function(dt, f) {
f = substitute(f)
dt[(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
Swapping to this seems to work, but the lack of output with verbose = TRUE
(compared to the hard-coded method at the top) leaves me concerned that it will not scale well when given the large data sets I'm working with
clearCol <- function(dt, f) {
f = deparse(substitute(f))
dt[get(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
colA
1: A
2: b
3: NA
4: d
5: NA
Is there another way of doing what I want?
Upvotes: 0
Views: 304
Reputation: 66819
You can follow FAQ 1.6 to get the verbose output:
cc = function(d, col, vs = c("~", ""), verb = FALSE){
col = substitute(col)
ix = substitute(col %in% vs)
d[eval(ix), as.character(col) := NA, verbose = verb ][]
}
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
cc(dt, colA, verb = TRUE)
which gives
Creating new index 'colA'
Starting bmerge ...done in 0 secs
Detected that j uses these columns: <none>
Assigning to 2 row subset of 5 rows
Dropping index 'colA' due to update on 'colA' (column 1)
colA colB
1: A
2: b ?
3: NA a1
4: d a2
5: NA z4
However, notice what the verbose output is saying here. It's creating an index (assuming you didn't do something to create it already, which seems likely since the data was only just read in)... and then it's removing that index (since it is invalidated by the edit to the column). That hardly sounds like something that would do much to contribute to efficiency.
If you really want to do this efficiently, there are a couple options:
na.strings
when reading the data inset
if you have a ton of columns and somehow can't do #1Upvotes: 1