Reputation: 1069
data.table
is a fantastic R package and I am using it in a library I am developing. So far all is going very well, except for one complication. It seems to be much more difficult (compared to the conventional data frames) to refer to data.table
columns using names saved in variables (as for data frames would be, for example: colname="col"; df[df[,colname]<5,colname]=0
).
Perhaps what complicates the things most is the apparent lack of consistency of syntax on this in data.table
. In some cases, eval(colname)
and get(colname)
, or even c(colname)
seem to work. In others, DT[,colname, with=F]
is the solution. Yet in others, such as, for example, the set()
and subset()
functions, I haven't found a solution at all. Finally, an extreme, albeit also quite common use case was discussed earlier (passing column names to data.table programmatically) and the proposed solutions, albeit apparently doing their job, did not seem particularly readable...
Perhaps I am complicating things too much? If anyone could jot down a quick cheatsheet for referring to data.table
column names using variables for different common scenarios, I would be very grateful.
UPDATE:
Some specific examples that work provided I can hard code column names:
x.short = subset(x, abs(dist)<=100)
set(x, which(x$val<10), "val", 0)
Now assume distcol="dist"
, valcol="val"
. What is the best way to do the above using distcol
and valcol
, but not dist
and val
?
Upvotes: 47
Views: 33384
Reputation: 165
Another neat solution is to rename your column to something fixed, do your operations on the fixed name column, and rename back. Avoids all the unreadable and unrememberable code.
setnames(dt, colname, "fixed_")
dt[,fixed_:=paste0("foo-",fixed_)]
setnames(dt, "fixed_", colname)
Upvotes: 3
Reputation: 325
eval
is definitely not a recommended approach to subset a data.table
using dynamically saved variables. The following example will help:
# Toy data.table example
DT = data.table(a = c(1,2,3), b = c(4,5,6))
# Saved variable
mVar <- "a"
# Subset
DT[DT[[mVar]] < 2]
eval
is very sensitive to complex character expressions and generally not recommended for production code.
Upvotes: 1
Reputation: 49448
If you are going to be doing complicated operations inside your j
expressions, you should probably use eval
and quote
. One problem with that in current version of data.table
is that the environment of eval
is not always correctly processed - eval and quote in data.table (Note: There has been an update to that answer based on an update to the package.) - and the current fix for that is to add .SD
to eval
. As far as I can tell from a few tests that I've run this doesn't affect speed (the way e.g. having .SD[1]
in j
would).
Interestingly this issue only plagues the j
and you'll be fine using eval
normally in i
(where .SD
is not available anyway).
The other problem is assignment, and there you have to have strings. I know one way to extract the string name from a quoted expression - it's not pretty, but it works. Here's an example combining everything together:
x = data.table(dist = c(1:10), val = c(1:10))
distcol = quote(dist)
valcol = quote(val)
x[eval(valcol) < 5,
capture.output(str(distcol, give.head = F)) := eval(distcol)*sum(eval(distcol, .SD))]
Note how I was ok not adding .SD
in one eval(distcol)
, but won't be if I take it out of the other eval
.
Another option is to use get
:
diststr = "dist"
valstr = "val"
x[get(valstr) < 5, c(diststr) := get(diststr)*sum(get(diststr))]
Upvotes: 29
Reputation: 498
Say you have the column name in variable x
, you could do
colname = as.name(x)
you can then use colname
in the subset
function
Upvotes: 4
Reputation: 66819
Maybe you know about this solution already?
DT[[colname]]
This is inspired by @eddi's solution in the comments below, using the OP's example:
set.seed(1)
x = data.table(a = 1:10, b=rnorm(10))
colstr="b"
col <- eval(parse(text=paste("quote(",colstr,")",sep="")))
x[eval(col)<0]
x[eval(col)<0,c(colstr):=-100]
Upvotes: 11