Reputation: 406

R merge() rbinds instead of merging

I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.

a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))

# > merge (a1, a1)
#   A
# 1 1
# 2 1
# 3 1
# 4 1

# > merge (a2, a2)
#   A
# 1 1
# 2 2

The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.

Upvotes: 2

Answers (2)

Milan Valášek

Reputation: 611

The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.

df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
  a b
1 1 a
2 2 b
3 3 c
df2
  a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
  a b c
1 1 a A
2 2 b B
3 3 c C

What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.

In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.

The scenarios are more apparent when you have multiple columns in your dfs:

Case 1:

> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
  a b
1 1 a
2 1 b
> df2
  a c
1 1 A
2 1 B
> merge(df1, df2)
  a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B

Case 2:

> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
  a b
1 1 a
2 2 b
> df2
  a c
1 1 A
2 2 B
> merge(df1, df2)
  a b c
1 1 a A
2 2 b B

Upvotes: 1

G. Grothendieck

Reputation: 269491

In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.

To match on row number use this:

merge(a1, a1, by = 0)
##   Row.names A.x A.y
## 1         1   1   1
## 2         2   1   1

or match on row number and only return the left instance:

library(sqldf)

sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
##   A
## 1 1
## 2 1

or match on row number and return both instances:

sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
##   A1 A2
## 1  1  1
## 2  1  1

Upvotes: 3

R merge() rbinds instead of merging

Answers (2)

Related Questions