Reputation: 43
I have many large dataframes. I want to get the first column is same and second column's difference is less than 5000 between two dataframes. such as:
>a
chr pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000
>b
chr pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000
my expected result:
>c
chr pos
chr2 10000
chr2 45000
chr2 60000
chr2 80000
I tried by this:
c<-data.frame(chr=0, pos=0)
for (i in 1:nrow(b)) {
c1<-a[(a$chr %in% b[i, 1]) & abs(a$pos-b[i, 2])<=5000, ]
c<-rbind(c, c1)
}
c<-c[-1, ]
But it's too slow and bad effiency. I hope to get a better way. Thanks advance!
Upvotes: 1
Views: 36
Reputation: 28705
library(data.table)
Create a
and b
a <- fread("
chr pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000
", data.table = F)
b <- fread("
chr pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000
", data.table = F)
setDT(a)
setDT(b)
@Jaap method, no extra columns needed
a[pos %inrange% b[, .(low = pos - 5000, high = pos + 5000)] & match(chr, b$chr)]
# chr pos
# 1: chr2 10000
# 2: chr2 45000
# 3: chr2 60000
# 4: chr2 80000
Join method, requiring extra columns to be added (same output)
Add bounds for considering pos
"similar"
a[, `:=`(low = pos - 5000, high = pos + 5000)]
Join b
with a
according to these bounds.
b[a, .(chr, pos), on = .(pos >= low, pos <= high, chr = chr), nomatch = 0]
SQL method, no extra columns needed (same output)
library(sqldf)
sqldf("
select a.*
from a
inner join b
on a.chr = b.chr
and b.pos between a.pos - 5000 and a.pos + 5000
")
Upvotes: 2