Reputation: 1017
I am using the wonderful R data.table package. However, accessing (i.e. manipulating by reference) a column with a variable name is very clumsy: If we are given a data.table dt
which has two columns x and y and we want to add two columns and name it z then the command is
dt = dt[, z := x + y]
Now let us write a function add
that takes as arguments a (reference to a) data.table dt
and three column names summand1Name
, summand2Name
and resultName
and it is supossed to execute the exact same command as above only with general column names. The solution I am using right now is reflection, i.e.
add = function(dt, summand1Name, summand2Name, resultName) {
cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
eval(parse(text=cmd))
return(dt) # optional since manipulated by reference
}
However I am absolutely not satisfied with this solution. First of all it's clumsy, it does not make fun to code like this. It is hard to debug and it just pisses me off and burns time. Secondly, it is harder to read and understand. Here is my question:
Can we write this function in a somewhat nicer way?
I am aware of the fact that one can access columns with variable name like so: dt[[resultName]]
but when I write
dt[[resultName]] = dt[[summand1Name]] + dt[[summand2Name]]
then data.table starts to complain about having taken copies and not working by reference. I don't want that. Also I like the syntax dt = dt[<all 'database related operations'>]
so that everything I am doing is stuck together in one pair of brackets. Isn't it possible to make use of a special symbol like backticks or so in order to indicate that the name currently used is not referencing an actual column of the data table but rather is a placeholder for the name of an actual column?
Upvotes: 5
Views: 3003
Reputation: 33498
Using get()
:
add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := get(summand1Name) + get(summand1Name)]
}
Using mget()
:
add2 <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := do.call(`+`, mget(c(summand1Name,summand2Name)))]
}
# Let
dt <- data.table(a = 1:5, b = 10:14)
# Then
add(dt, 'x', 'y', 'z')
dt[]
# x y z
# 1: 1 2 2
Upvotes: 1
Reputation: 360
new_add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)]
}
This just takes the column names as strings. Adding this to amatsuo_net's speed test, and adding sindri's two versions too, we get the following:
microbenchmark::microbenchmark(
original_add(dt, 'a', 'b', 'c'),
my_add(dt, 'a', 'b', 'c'),
list_access_add(dt, 'a', 'b', 'c'),
david_add(dt, 'a', 'b', 'c'),
new_add(dt, 'a', 'b', 'c'),
get_add(dt, 'a', 'b', 'c'),
mget_add(dt, 'a', 'b', 'c'))
## Unit: microseconds
## expr min lq mean median uq max neval
## original_add(dt, "a", "b", "c") 433.3 491.00 635.315 531.4 600.00 6064.0 100
## my_add(dt, "a", "b", "c") 978.0 1062.35 1310.808 1208.8 1357.80 4157.3 100
## list_access_add(dt, "a", "b", "c") 303.9 331.95 432.939 363.8 434.05 3361.6 100
## david_add(dt, "a", "b", "c") 401.3 440.65 659.748 474.5 577.75 11623.0 100
## new_add(dt, "a", "b", "c") 518.9 588.30 765.394 667.1 741.95 5636.5 100
## get_add(dt, "a", "b", "c") 415.1 454.50 674.699 491.1 546.70 9804.3 100
## mget_add(dt, "a", "b", "c") 425.4 474.65 596.165 533.2 590.75 3888.0 100
It's not the fastest among the versions, but if you're looking for code that's painless to write then this is pretty simple. Since it works off of rowSums
, it can also more easily be generalised to sum over an arbitrary number of columns at once.
Additionally, since dt
isn't mentioned inside the square brackets, you can add this column definition inside a data.table "pipe" instead of as a function, if you want to:
dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)
][, lapply(.SD, range), .SDcols = c(summand1Name, summand2Name, resultName)
][... # etc
]
Upvotes: 1
Reputation: 2448
You can combine the use of ()
on the LHS of :=
as well as with = FALSE
in referencing a variable on the RHS.
dt <- data.table(a = 1:5, b = 10:14)
my_add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := dt[, summand1Name, with = FALSE] +
dt[, summand1Name, with = FALSE]]
}
my_add(dt, 'a', 'b', 'c')
dt
Edit:
Compared three versions. Mine is the most inefficient... (but will keep it just for reference).
set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
original_add <- function(dt, summand1Name, summand2Name, resultName) {
cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
eval(parse(text=cmd))
return(dt) # optional since manipulated by reference
}
my_add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := dt[, summand1Name, with = FALSE] +
dt[, summand1Name, with = FALSE]]
}
list_access_add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := dt[[summand1Name]] + dt[[summand2Name]]]
}
david_add <- function(dt, summand1Name, summand2Name, resultName) {
dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]
}
microbenchmark::microbenchmark(
original_add(dt, 'a', 'b', 'c'),
my_add(dt, 'a', 'b', 'c'),
list_access_add(dt, 'a', 'b', 'c'),
david_add(dt, 'a', 'b', 'c'))
## Unit: microseconds
## expr min lq mean median uq max
## original_add(dt, "a", "b", "c") 604.397 659.6395 784.2206 713.0315 776.1295 5070.541
## my_add(dt, "a", "b", "c") 1063.984 1168.6140 1460.5329 1247.7990 1486.9730 6134.959
## list_access_add(dt, "a", "b", "c") 272.822 310.9680 422.6424 334.3110 380.6885 3620.463
## david_add(dt, "a", "b", "c") 389.389 431.9080 542.7955 454.5335 493.4895 3696.992
## neval
## 100
## 100
## 100
## 100
Edit2:
With one million rows, the result looks like this. As expected the original method perform well as once eval
is done this will work fast.
## Unit: milliseconds
## expr min lq mean median uq max
## original_add(dt, "a", "b", "c") 2.493553 3.499039 6.585651 3.607101 4.390051 114.0612
## my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
## list_access_add(dt, "a", "b", "c") 2.161276 3.133110 6.874885 3.218185 3.407776 107.6853
## david_add(dt, "a", "b", "c") 2.237089 3.313133 6.047832 3.381757 3.788558 103.7532
## neval
## 100
## 100
## 100
## 100
Upvotes: 1
Reputation: 2535
Here's another solution using substitute
. I generally try to avoid using substitute
, but I think it's the only way of using fast data.table
and :=
code instead of native list access.
I kept to the interface of amatsuo_net.
set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
snaut_add <- function(dt, summand1, summand2, resultName){
eval(substitute(
dt[, z := x + y],
list(
z=as.symbol(resultName),
x=as.symbol(summand1),
y=as.symbol(summand2)
)
))
}
snaut_add(dt, "a", "b", "c")
dt
Upvotes: 0