Jan van der Laan
Jan van der Laan

Reputation: 8105

Create variable number of columns in data.table using function

I have a function that given an input vector returns a data.frame or data.table; the number of columns and the names of the columns depend on the input. I want to add these columns to an existing data.table using one of the columns of the data.table as input for the function. What is the easiest/cleanest way of doing this in a data.table?

# Example function; in this case the number of columns the function
# returns is fixed, but in practice the number of columns and the
# names of the columns depend on x
my_function <- function(x) {
  name <- deparse1(substitute(x))
  res <- data.table(x == 1, x == 2)
  names(res) <- paste0(name, "==", 1:2)
  res
}

# Example data set
dta <- data.table(a = sample(1:10, 10, replace = TRUE), b = letters[1:10])

I can create new columns using this function:

> dta[, my_function(a)]
     a==1  a==2
 1: FALSE FALSE
 2: FALSE FALSE
 3: FALSE FALSE
 4: FALSE FALSE
 5: FALSE FALSE
 6:  TRUE FALSE
 7: FALSE FALSE
 8:  TRUE FALSE
 9: FALSE  TRUE
10:  TRUE FALSE

However, I also want to keep existing columns. The following does what I want, but I expect there is a simpler/better solution. I also expect that the cbind will introduce a copy of the data which is another reason I want to avoid this as the data sets are quite large.

> dta <- cbind(dta, dta[, my_function(a)])
> dta
     a b  a==1  a==2
 1:  1 a  TRUE FALSE
 2:  8 b FALSE FALSE
 3:  2 c FALSE  TRUE
 4:  4 d FALSE FALSE
 5: 10 e FALSE FALSE
 6:  4 f FALSE FALSE
 7:  8 g FALSE FALSE
 8: 10 h FALSE FALSE
 9:  8 i FALSE FALSE
10:  4 j FALSE FALSE

Upvotes: 1

Views: 316

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

Here is one way which avoids copying the original data.table object :

library(data.table)
#Create a temporary object
tmp <- dta[,my_function(a)] 
#Create column names
cols <- paste0('cols', seq_along(tmp)) 
#Add the temporary object with new column names
dta[, (cols) := tmp]

Benchmark added by OP

Below the function I used to benchmark the solutions:

library(data.table)
my_function <- function(x) {
  name <- deparse1(substitute(x))
  res <- data.table(x == 1, x == 2)
  names(res) <- paste0(name, "==", 1:2)
  res
}
set.seed(1)
N <- 2E7
x <- sample(1:10, N, replace = TRUE)
dta <- data.table()
dta[, (letters[1:24]) := x]

t <- system.time({
  tmp <- dta[, my_function(a)]
  cols <- names(tmp)
  dta[, (cols) := tmp]
})
#t <- system.time({
#  dta <- cbind(dta, dta[, my_function(a)])
#})
print(t)

The command was run under Linux (Ubuntu 20.04) using /bin/time -v Rscript bench.R. time reports max memory use in the field Maximum resident set size (kbytes).

For the cbind solution the reported user time was 1.362 seconds and max memory 4206072 kbytes.

For the solution above the reported user time was 0.339 seconds and max memory 2486996 kbytes.

The solution above is threfore faster and uses less memory than the cbind version.

Upvotes: 1

Related Questions