Reputation: 8105
I have a function that given an input vector returns a data.frame
or data.table
; the number of columns and the names of the columns depend on the input. I want to add these columns to an existing data.table
using one of the columns of the data.table
as input for the function. What is the easiest/cleanest way of doing this in a data.table
?
# Example function; in this case the number of columns the function
# returns is fixed, but in practice the number of columns and the
# names of the columns depend on x
my_function <- function(x) {
name <- deparse1(substitute(x))
res <- data.table(x == 1, x == 2)
names(res) <- paste0(name, "==", 1:2)
res
}
# Example data set
dta <- data.table(a = sample(1:10, 10, replace = TRUE), b = letters[1:10])
I can create new columns using this function:
> dta[, my_function(a)]
a==1 a==2
1: FALSE FALSE
2: FALSE FALSE
3: FALSE FALSE
4: FALSE FALSE
5: FALSE FALSE
6: TRUE FALSE
7: FALSE FALSE
8: TRUE FALSE
9: FALSE TRUE
10: TRUE FALSE
However, I also want to keep existing columns. The following does what I want, but I expect there is a simpler/better solution. I also expect that the cbind
will introduce a copy of the data which is another reason I want to avoid this as the data sets are quite large.
> dta <- cbind(dta, dta[, my_function(a)])
> dta
a b a==1 a==2
1: 1 a TRUE FALSE
2: 8 b FALSE FALSE
3: 2 c FALSE TRUE
4: 4 d FALSE FALSE
5: 10 e FALSE FALSE
6: 4 f FALSE FALSE
7: 8 g FALSE FALSE
8: 10 h FALSE FALSE
9: 8 i FALSE FALSE
10: 4 j FALSE FALSE
Upvotes: 1
Views: 316
Reputation: 388982
Here is one way which avoids copying the original data.table
object :
library(data.table)
#Create a temporary object
tmp <- dta[,my_function(a)]
#Create column names
cols <- paste0('cols', seq_along(tmp))
#Add the temporary object with new column names
dta[, (cols) := tmp]
Below the function I used to benchmark the solutions:
library(data.table)
my_function <- function(x) {
name <- deparse1(substitute(x))
res <- data.table(x == 1, x == 2)
names(res) <- paste0(name, "==", 1:2)
res
}
set.seed(1)
N <- 2E7
x <- sample(1:10, N, replace = TRUE)
dta <- data.table()
dta[, (letters[1:24]) := x]
t <- system.time({
tmp <- dta[, my_function(a)]
cols <- names(tmp)
dta[, (cols) := tmp]
})
#t <- system.time({
# dta <- cbind(dta, dta[, my_function(a)])
#})
print(t)
The command was run under Linux (Ubuntu 20.04) using /bin/time -v Rscript bench.R
. time
reports max memory use in the field Maximum resident set size (kbytes)
.
For the cbind solution the reported user time was 1.362 seconds and max memory 4206072 kbytes.
For the solution above the reported user time was 0.339 seconds and max memory 2486996 kbytes.
The solution above is threfore faster and uses less memory than the cbind
version.
Upvotes: 1