Reputation: 185

Is there an optimal way to create a bunch of new columns with data.table?

I am using data.table I have a vector of the names for the new columns. I want to create these new columns but using the information from the old columns. Let me show it on a following example:

data <- data.table(a = c("OneA", "TwoB", "ThreeC"),
                   b = c(1, 2, 3))

newCols <- c("One", "Two", "Three")

for (newCol in newCols) {
  data[, eval(newCol) := gsub(paste0("^.*", newCol), "", a)]
}

As an output I expect (and get) the following:

> data
        a   b    One    Two Three
1:   OneA   1      A   OneA  OneA
2:   TwoB   2   TwoB      B  TwoB
3: ThreeC   3 ThreeC ThreeC     C

In this case I combine defining new columns from the vector and using the vector values themselves to fill these columns. Is there a way to do it more optimal (for example with a set())?

Upvotes: 4

Answers (3)

Jaap

Reputation: 83275

Actually, you where quite close. Just adapt your code to:

for (newCol in newCols) {
  DT[, (newCol) := sub(newCol, "", a)]
}

and you have a solution that is both fast and memory efficient (better than using lapply).

Alternatively you can also use set inside the for-loop:

for (newCol in newCols) {
  set(DT, j = newCol, value = sub(newCol, "", DT[["a"]]))
}

So why is a for-loop a better choice in this situation?

With lapply the RHS (right hand sight) of := is evaluated first. This means that all the new columns are created first and have to be allocated and populated in memory first and are then added to the data.table. The for-loop approach is more efficient because it does one column at a time and thus only needs working memory for that one column.
set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop). set eliminates the small overhead of the [data.table-method and is as a result therefore faster.

Above explanation is based on this answer by Matt Dowle (creator of data.table) and the help-file ?set.

Upvotes: 4

ThomasIsCoding

Reputation: 102890

A base R solution of using Map()+cbind()

data <- as.data.table(c(data,`names<-`(Map(function(x) gsub(x,"",data$a),newCols),newCols)))

such that

> data
        a b    One    Two Three
1:   OneA 1      A   OneA  OneA
2:   TwoB 2   TwoB      B  TwoB
3: ThreeC 3 ThreeC ThreeC     C

Upvotes: 1

Joris C.

Reputation: 6244

One possible approach could be:

library(data.table)

DT[, (newCols) := lapply(newCols, function(x) sub(x, "", a))][]
#>         a b    One    Two Three
#> 1:   OneA 1      A   OneA  OneA
#> 2:   TwoB 2   TwoB      B  TwoB
#> 3: ThreeC 3 ThreeC ThreeC     C

Data

DT <- data.table(a = c("OneA", "TwoB", "ThreeC"), b = c(1, 2, 3))
newCols <- c("One", "Two", "Three")

Upvotes: 6

Is there an optimal way to create a bunch of new columns with data.table?

Answers (3)

Related Questions