Biblot
Biblot

Reputation: 705

Left join two R data frames with OR conditions

Problem

I have two data frames that I want to join using a conditional statement on three non-numeric variables. Here is a pseudo-code version of what I want to achieve.

Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B

Dataset

Here's some code to create the two data frames. variant_index is the data frame that will be used to annotate input using a left_join:

library(dplyr)
options(stringsAsFactors = FALSE)

set.seed(5)
variant_index <- data.frame(
  rsid   = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
  chrom  = rep(sample(1:22, 5), each = 2),
  ref    = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
  alt    = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
  eaf    = runif(10),
  stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"

input <- variant_index[seq(1, 10, 2), ] %>%
  select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")

What I tried

I would like to perform a left_join on input to annotate with the eaf column from variant_index. As you can see from the input data frame, its assessed column can match either with input$ref or with input$alt. The rsid and chrom column will always match.

I know I can specify multiple column in the by argument of left_join, but if I understand correctly, the condition will always be

input$assessed == variant_index$ref & input$assessed == variant_index$alt

whereas I want to achieve

input$assessed == variant_index$ref | input$assessed == variant_index$alt

Possible solution

The desired output can be obtained like so:

input %>% 
  left_join(variant_index) %>% 
  filter(assessed == ref | assessed == alt)

But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines. Is there a better solution?

Upvotes: 4

Views: 952

Answers (2)

Ransingh Satyajit Ray
Ransingh Satyajit Ray

Reputation: 404

Try this

library(dbplyr) 
x1 <- memdb_frame(x = 1:5) 
x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3]) 
x1 <- x1 %>% left_join(b, sql_on = "a.x=b.x1 or a.x=b.x2")

we can use show_query to see the code

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 269311

Complex joins are straight forward in SQL:

library(sqldf)

sqldf("select *
  from variant_index v
  join input i on i.assessed = v.ref or i.assessed = v.alt")

Upvotes: 3

Related Questions