Reputation: 43
Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?
Example: There are two data frames df1
and df2
with columns x, y, num
. I would like to have a data frame with all rows from df1
, but with only those rows from df2
which satisfy the conditions: df1$x == df2$x
OR df2$y == df2y
.
Here are sample data:
df1 <- data.frame(x = LETTERS[1:5],
y = 1:5,
num = rnorm(5), stringsAsFactors = F)
df1
x y num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559
df2 <- data.frame(x = LETTERS[3:7],
y = 3:7,
num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
x y num
1 C NA -0.7160824
2 <NA> 4 -0.3283618
3 E 5 -1.8775298
4 F 6 -0.9821082
5 G 7 1.8726288
Then, the result is expected to be:
x y num x y num
1 A 1 0.4209480 <NA> NA NA
2 B 2 0.4687401 <NA> NA NA
3 C 3 0.3018787 C NA -0.7160824
4 D 4 0.0669793 <NA> 4 -0.3283618
5 E 5 0.9231559 E 5 -1.8775298
The most obvious solution is to use the sqldf
package:
mergedData <- sqldf::sqldf("SELECT * FROM df1
LEFT OUTER JOIN df2
ON df1.x = df2.x
OR df1.y = df2.y")
Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.
Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?
Upvotes: 3
Views: 1129
Reputation: 118779
Here's one approach using data.table
. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).
require(data.table)
setDT(df1)
setDT(df2)
foo <- function(dx, dy, cols) {
ix = lapply(cols, function(col) {
dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
# by matching on column specified in "col"
})
ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
# x y num col1 col2 col3
# 1: A 1 2.09611034 NA NA NA
# 2: B 2 -1.06795571 NA NA NA
# 3: C 3 1.38254433 C 3 1.0173476
# 4: D 4 -0.09367922 D 4 -0.6379496
# 5: E 5 0.47552072 E NA -0.1962038
You can use setDF(df1)
to convert it back to a data.frame, if necessary.
Upvotes: 1