Selection by row if column name is a string

Question

I have to deal with huge data sets, that is why I have to use data.table package. I would like to pick only rows which have appropriate values in given column.

dt <- data.table(a = rep(c("A", "B", "C"), 3), 
                 b = 1:9)
n <- c("A", "C")

What I can do:

dt[ a %in% n]
   a b
1: A 1
2: C 3
3: A 4
4: C 6
5: A 7
6: C 9

What if I do not know name of column before and get it from function as string? I tried:

dt[ "a" %in% n]
Empty data.table (0 rows) of 2 cols: a,b

dt[ "a" %in% n, with  = F]
Error in `[.data.table`(dt, "a" %in% n, with = F) : 
  j must be provided when with=FALSE

dt[ as.name("a") %in% n ]
Error in match(x, table, nomatch = 0L) : 
  'match' requires vector arguments

Question - is it possible to use string in such a task?

David Arenburg · Accepted Answer

Could achieve this in various ways

Either using joins (which will return an ordered result, without sorting dt in place)

dt[.(n), on = "a"]

Or (which will also return an ordered result while sorting dt in place)

setkeyv(dt, "a")[.(n)]

Or using eval/as.name (which will return an unsorted result)

dt[eval(as.name("a")) %in% n]

Or using a list subset (which will return an unsorted result)

dt[dt[["a"]] %in% n]

A few points you might want to take consideration of.

in v 1.9.4 data.table package introduced secondary indexes. This means that when you do a vector scan (== or %in%) it will create a some sort or a key (similarly as it would if you will use setkey explicitly) which will make the first run a bit slower but will significantly increase performance the next time you will search for matches in this column.

Though, not in all cases the secondary key will be (permanently) set. In some cases it will be always a simple vector scan (e.g. dt[eval(as.name("a")) %in% n, verbose = TRUE] or dt[get('a') %in% n] or dt[dt[["a"]] %in% n, verbose = TRUE]) which will be faster if you only run this once. In some case an ad-hoc index will be created within each run (e.g. dt[n, on = "a", verbose = TRUE], but will also use an index if it already exists) which is sub-optimal, while in some cases a permanent key or a secondary key will be set (e.g. setkeyv(dt, "a")[.(n), verbose = TRUE] or as proposed by @Frank dt[eval(substitute(col %in% n, list(col=as.name("a")))), verbose = TRUE])

Selection by row if column name is a string

Answers (2)

Related Questions