jay.sf
jay.sf

Reputation: 72613

How to merge a list of data.tables without getting splitted columns?

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R
system.time(
  M1 <- Reduce(function(...) merge(..., all=TRUE), L)
  )
# user  system elapsed 
# 5.05    0.00    5.20 

# data.table    
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
  )
# user  system elapsed 
# 0.12    0.00    0.12

Both approaches yield the same values, however there are some columns that are split with data.table.

base R:

set.seed(1)
car::some(M1, 5)
#        sid    id         V3        V4          a         b
# 60504    1 60504 -0.6964804 -1.210195         NA        NA
# 79653    1 79653 -2.5287163 -1.087546         NA        NA
# 111637   2 11637  0.7104236        NA -1.7377657        NA
# 171855   2 71855  0.2023342        NA -0.6334279        NA
# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)
car::some(M2, 5)
#    sid    id       V3.x        V4      V3.y          a         V3         b
# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA
# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA
# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA
# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA
# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

Data

fun <- function(x){
  set.seed(x)
  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)

Related: 1, 2

Upvotes: 1

Views: 1164

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

Upvotes: 6

Related Questions