Reputation: 65
I have two different data frames that I want to find common rows between them according to two different numeric columns (these columns are the location of the genes in genome, start and end), but I don't want to be completely match based on those two columns and I want to find commons with (-+) ranges.
Actually I want to find common rows even they are with for example 200 bp or 500 bp differences in start and end locations. I am beginner in R and I could not find a way to do this.
In df2 although DEC1 has different START location but its difference is 400 bp with DEC1 in df1, so i want to consider this gene as a common between df1 and 2 but PSA and AKT have difference in START more than 500bp and I do not need them, although they have same END position between df1 and 2.
df1 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494957", "39689186", "89435677", "78484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
df2 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494557", "37689186", "89435677", "79484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
Upvotes: 0
Views: 75
Reputation: 1348
To my knowledge, merge
or join
would be the only way to compare two columns.
Using data.table
,
require(data.table)
#> Loading required package: data.table
df1 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494957, 39689186, 89435677, 78484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df2 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494557, 37689186, 89435677, 79484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df3 <- df1[df2[,.(name,START2=START, END2 = END)], on='name']
df3[abs(START2-START) %between% c(0,500) |
abs(START2-START) %between% c(0,500)]
#> name START END STRAND START2 END2
#> 1: DEC1 9494957 52521320 + 9494557 52521320
#> 2: DEC2 89435677 100952138 + 89435677 100952138
Or using dplyr
,
df3 <- inner_join(df1, df2, suffix=c('1','2'),by='name')
df3 %>% filter(abs(START2-START1)<500)
Created on 2022-04-30 by the reprex package (v2.0.1)
Upvotes: 1