greyBag
greyBag

Reputation: 397

cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to cbind, however, I only need the last X columns.

My data is structured as follows:

DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))

DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
    [[1]]
   x y v1 v2
1: 1 a  1  3
2: 1 a  2  4

[[2]]
   x y v3
1: 1 a  5
2: 1 a  6

[[3]]
   x y v4 v5 v6
1: 1 a  7  9 11
2: 1 a  8 10 12

Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:

   x y v1 v2 v3 v4 v5 v6
1: 1 a  1  3  5  7  9 11
2: 1 a  2  4  6  8 10 12

I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.

thanks

Upvotes: 5

Views: 5672

Answers (3)

Frank
Frank

Reputation: 66819

Here's another way:

Reduce(
  function(x,y){
    newcols = setdiff(names(y),names(x))
    x[,(newcols)] <- y[, ..newcols]
    x
  }, 
  DT.list,
  init = copy(DT.list[[1]][,c("x","y")])
)
#    x y v1 v2 v3 v4 v5 v6
# 1: 1 a  1  3  5  7  9 11
# 2: 1 a  2  4  6  8 10 12

This avoids modifying the list (as @bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Upvotes: 2

bgoldst
bgoldst

Reputation: 35314

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():

do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:

for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
##    x y v1 v2
## 1: 1 a  1  3
## 2: 1 a  2  4
##
## [[2]]
##    v3
## 1:  5
## 2:  6
##
## [[3]]
##    v4 v5 v6
## 1:  7  9 11
## 2:  8 10 12
##
do.call(cbind,DT.list);
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Upvotes: 1

Valentin_Ștefan
Valentin_Ștefan

Reputation: 6436

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.

# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)

A) using a character vector to indicate which columns to exclude

# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
##    v1 v2 v3 v4 v5 v6
## 1:  1  3  5  7  9 11
## 2:  2  4  6  8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

B) same as above but using the character vector directly in lapply

myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
##    v1 v2 v3 v4 v5 v6
## 1:  1  3  5  7  9 11
## 2:  2  4  6  8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

C) same as above, but all in one line

do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Upvotes: 0

Related Questions