mehdi heidari
mehdi heidari

Reputation: 65

How can I find common rows between two dataframes based on two different numeric columns?

I have two different data frames that I want to find common rows between them according to two different numeric columns (these columns are the location of the genes in genome, start and end), but I don't want to be completely match based on those two columns and I want to find commons with (-+) ranges.

Actually I want to find common rows even they are with for example 200 bp or 500 bp differences in start and end locations. I am beginner in R and I could not find a way to do this.

In df2 although DEC1 has different START location but its difference is 400 bp with DEC1 in df1, so i want to consider this gene as a common between df1 and 2 but PSA and AKT have difference in START more than 500bp and I do not need them, although they have same END position between df1 and 2.

df1 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494957", "39689186", "89435677", "78484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
df2 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494557", "37689186", "89435677", "79484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))

Upvotes: 0

Views: 75

Answers (1)

YH Jang
YH Jang

Reputation: 1348

To my knowledge, merge or join would be the only way to compare two columns.

Using data.table,

require(data.table)
#> Loading required package: data.table
df1 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494957, 39689186, 89435677, 78484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df2 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494557, 37689186, 89435677, 79484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df3 <- df1[df2[,.(name,START2=START, END2 = END)], on='name']

df3[abs(START2-START) %between% c(0,500) |
      abs(START2-START) %between% c(0,500)] 
#>    name    START       END STRAND   START2      END2
#> 1: DEC1  9494957  52521320      +  9494557  52521320
#> 2: DEC2 89435677 100952138      + 89435677 100952138

Or using dplyr,

df3 <- inner_join(df1, df2, suffix=c('1','2'),by='name')
df3 %>% filter(abs(START2-START1)<500)

Created on 2022-04-30 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions