Dr. Beeblebrox
Dr. Beeblebrox

Reputation: 848

data.table merge produces extra columns [R]

Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.

library(data.table)

a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9),  C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12),  C3 = c(8, 2, 3), C4 = c(4, 6, 8))

setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")

final <- merge(a, b,  all = TRUE)
final <- merge(final, c,  all = TRUE)
final <- merge(final, d,  all = TRUE)

names(final)
dim(final)  #outputs correct numb of rows, but too many columns

Upvotes: 5

Views: 2444

Answers (1)

R for the Win
R for the Win

Reputation: 116

The problem is with the way you are using the 'merge' function. 'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:

library(data.table)
a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")

where 'a' is going to be like:

   id C1
1:  1  1
2:  2  2
3:  3  3

and 'b' is going to be like:

   id C1 C2
1:  4  1  2
2:  5  2  3
3:  6  3  4

now, lets first try your code:

merge(a, b,  all = TRUE)

This is the result:

   id C1.x C1.y C2
1:  1    1   NA NA
2:  2    2   NA NA
3:  3    3   NA NA
4:  4   NA    1  2
5:  5   NA    2  3
6:  6   NA    3  4

This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:

merge(a, b, by=c("id","C1"), all = TRUE)

now the result is going to be:

   id C1 C2
1:  1  1 NA
2:  2  2 NA
3:  3  3 NA
4:  4  1  2
5:  5  2  3
6:  6  3  4

Same applies to other merge functions you called. So try this:

final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE)  #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)

dim(final)
[1] 12  5

Upvotes: 6

Related Questions