srao
srao

Reputation: 308

Merge multiple data tables with duplicate column names

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

Upvotes: 13

Views: 11009

Answers (7)

AdamKarol
AdamKarol

Reputation: 1

This is an alternative solution - you can define join columns each time (when your x columns do not have the same values) . You need to define vectors with column names. Then you may chain joining by reference like this:

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]

Upvotes: 0

Henrik Seidel
Henrik Seidel

Reputation: 301

Another way of doing this:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

Upvotes: 2

Jaap
Jaap

Reputation: 83275

If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")

Upvotes: 9

Veerendra Gadekar
Veerendra Gadekar

Reputation: 4472

Alternatively you could setNames for the columns before and do merge like this

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19

Upvotes: 1

Frank
Frank

Reputation: 66819

stack and reshape I don't think this maps exactly to the merge function but...

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

I have no sense for if this would extend to having more columns than y.

merge-assign

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

Upvotes: 5

eddi
eddi

Reputation: 49448

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19

Upvotes: 7

bramtayl
bramtayl

Reputation: 4024

Using reshaping gives you a lot more flexibility in how you want to name your columns.

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

Or, this would work

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)

Upvotes: 3

Related Questions