Peter
Peter

Reputation: 11

Merge is multiplying data frames when it should left merge

I have two data frames "cb" and "ca" which share a column name "Item", with elements of the same type (factor). Yet, left-merge fails to produce a data frame of the same size as the x data frame. What am I missing?? Many thanks in advance.

ca <- data.frame(A = c(1,2,3,2,1), Item = c("1","1","2","2","2"))
cb <- data.frame(B = c(2,3,1,2), Item = c("1","1","2","2"))
t <- merge(cb,ca, by = "Item", all.x = TRUE, sort = FALSE)

Merge results in:

   Item B A
1     1 2 1
2     1 2 2
3     1 3 1
4     1 3 2
5     2 1 3
6     2 1 2
7     2 1 1
8     2 2 3
9     2 2 2
10    2 2 1

But what I need is:

> t
Item A B
1    1 2
1    2 3
2    3 1
2    2 2

Many thanks in advance!

Upvotes: 0

Views: 470

Answers (1)

r2evans
r2evans

Reputation: 160407

As @won782 suggesed, it is likely due to duplicates. Here is a small MWE demonstrating what duplicates (on both sides) will do to a merge.

First, we have no duplicates:

ca <- data.frame(A = 1:4  , Item = 1:4)
cb <- data.frame(B = 11:13, Item = 1:3)
merge(ca, cb, by = "Item", all.x = TRUE)
#   Item A  B
# 1    1 1 11
# 2    2 2 12
# 3    3 3 13
# 4    4 4 NA

If we introduce dupes into the "left" side, all looks good still:

ca <- data.frame(A = 1:4  , Item = c(1L,1:3))
cb <- data.frame(B = 11:13, Item = 1:3)
merge(ca, cb, by = "Item", all.x = TRUE)
#   Item A  B
# 1    1 1 11
# 2    1 2 11
# 3    2 3 12
# 4    3 4 13

When we have duplicates in the right side, though, each matching row on the left is duplicated appropriately. Namely, see "Item 1"

ca <- data.frame(A = 1:4  , Item = 1:4)
cb <- data.frame(B = 11:14, Item = c(1L,1:3))
merge(ca, cb, by = "Item", all.x = TRUE)
#   Item A  B
# 1    1 1 11
# 2    1 1 12
# 3    2 2 13
# 4    3 3 14
# 5    4 4 NA

Notice how 4 rows in ca produced 5 rows in the merge.

If we have dupes in both sides, it explodes a little. Now we have "Item 1, A 1" twice" (same for "A 2"):

ca <- data.frame(A = 1:4  , Item = c(1L,1:3))
cb <- data.frame(B = 11:14, Item = c(1L,1:3))
merge(ca, cb, by = "Item", all.x = TRUE)
#   Item A  B
# 1    1 1 11
# 2    1 1 12
# 3    1 2 11
# 4    1 2 12
# 5    2 3 13
# 6    3 4 14

There are several good tutorials on merging data. Don't limit yourself to searching for "R merge", you can find a lot with searches for "SQL join" as well (adding "tutorial" and/or "howto" to those searches might be helpful).

Bottom line, though, is that there is no "general answer" for this problem, as it is actually "by design". Anything other than this cartesian join would require either data reduction or use of another key (existing or additional) for clarifying how you want things to join.

Upvotes: 1

Related Questions