Reputation: 397

cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to cbind, however, I only need the last X columns.

My data is structured as follows:

DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))

DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
    [[1]]
   x y v1 v2
1: 1 a  1  3
2: 1 a  2  4

[[2]]
   x y v3
1: 1 a  5
2: 1 a  6

[[3]]
   x y v4 v5 v6
1: 1 a  7  9 11
2: 1 a  8 10 12

Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:

   x y v1 v2 v3 v4 v5 v6
1: 1 a  1  3  5  7  9 11
2: 1 a  2  4  6  8 10 12

I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.

thanks

Upvotes: 5

Answers (3)

Frank

Reputation: 66819

Here's another way:

Reduce(
  function(x,y){
    newcols = setdiff(names(y),names(x))
    x[,(newcols)] <- y[, ..newcols]
    x
  }, 
  DT.list,
  init = copy(DT.list[[1]][,c("x","y")])
)
#    x y v1 v2 v3 v4 v5 v6
# 1: 1 a  1  3  5  7  9 11
# 2: 1 a  2  4  6  8 10 12

This avoids modifying the list (as @bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Upvotes: 2

bgoldst

Reputation: 35314

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():

do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:

for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
##    x y v1 v2
## 1: 1 a  1  3
## 2: 1 a  2  4
##
## [[2]]
##    v3
## 1:  5
## 2:  6
##
## [[3]]
##    v4 v5 v6
## 1:  7  9 11
## 2:  8 10 12
##
do.call(cbind,DT.list);
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Upvotes: 1

Valentin_Ștefan

Reputation: 6436

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.

# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)

A) using a character vector to indicate which columns to exclude

# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
##    v1 v2 v3 v4 v5 v6
## 1:  1  3  5  7  9 11
## 2:  2  4  6  8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

B) same as above but using the character vector directly in `lapply`

myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
##    v1 v2 v3 v4 v5 v6
## 1:  1  3  5  7  9 11
## 2:  2  4  6  8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

C) same as above, but all in one line

do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Upvotes: 0

cbind specific columns from multiple data.tables efficiently

Answers (3)

A) using a character vector to indicate which columns to exclude

B) same as above but using the character vector directly in lapply

C) same as above, but all in one line

Related Questions

B) same as above but using the character vector directly in `lapply`