Reputation: 85
I have to deal with huge data sets, that is why I have to use data.table package. I would like to pick only rows which have appropriate values in given column.
dt <- data.table(a = rep(c("A", "B", "C"), 3),
b = 1:9)
n <- c("A", "C")
What I can do:
dt[ a %in% n]
a b
1: A 1
2: C 3
3: A 4
4: C 6
5: A 7
6: C 9
What if I do not know name of column before and get it from function as string? I tried:
dt[ "a" %in% n]
Empty data.table (0 rows) of 2 cols: a,b
dt[ "a" %in% n, with = F]
Error in `[.data.table`(dt, "a" %in% n, with = F) :
j must be provided when with=FALSE
dt[ as.name("a") %in% n ]
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
Question - is it possible to use string in such a task?
Upvotes: 2
Views: 116
Reputation: 92300
Could achieve this in various ways
Either using joins (which will return an ordered result, without sorting dt
in place)
dt[.(n), on = "a"]
Or (which will also return an ordered result while sorting dt
in place)
setkeyv(dt, "a")[.(n)]
Or using eval
/as.name
(which will return an unsorted result)
dt[eval(as.name("a")) %in% n]
Or using a list subset (which will return an unsorted result)
dt[dt[["a"]] %in% n]
A few points you might want to take consideration of.
in v 1.9.4 data.table package introduced secondary indexes. This means that when you do a vector scan (==
or %in%
) it will create a some sort or a key (similarly as it would if you will use setkey
explicitly) which will make the first run a bit slower but will significantly increase performance the next time you will search for matches in this column.
Though, not in all cases the secondary key will be (permanently) set. In some cases it will be always a simple vector scan (e.g. dt[eval(as.name("a")) %in% n, verbose = TRUE]
or dt[get('a') %in% n]
or dt[dt[["a"]] %in% n, verbose = TRUE]
) which will be faster if you only run this once. In some case an ad-hoc index will be created within each run (e.g. dt[n, on = "a", verbose = TRUE]
, but will also use an index if it already exists) which is sub-optimal, while in some cases a permanent key or a secondary key will be set (e.g. setkeyv(dt, "a")[.(n), verbose = TRUE]
or as proposed by @Frank dt[eval(substitute(col %in% n, list(col=as.name("a")))), verbose = TRUE]
)
Upvotes: 1
Reputation: 613
Use get
function
dt[get('a') %in% n]
It also works if colname is stored in another variable:
temp <- 'a'
dt[get(temp) %in% n]
Upvotes: 0