J.Wan
J.Wan

Reputation: 43

How to get the rows that one column is same and another column is similar between dataframe?

I have many large dataframes. I want to get the first column is same and second column's difference is less than 5000 between two dataframes. such as:

>a
chr  pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000

>b
chr  pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000

my expected result:

>c
chr  pos
chr2 10000
chr2 45000
chr2 60000
chr2 80000

I tried by this:

c<-data.frame(chr=0, pos=0)
for (i in 1:nrow(b)) {
    c1<-a[(a$chr %in% b[i, 1]) & abs(a$pos-b[i, 2])<=5000, ]
    c<-rbind(c, c1)
  }
c<-c[-1, ]

But it's too slow and bad effiency. I hope to get a better way. Thanks advance!

Upvotes: 1

Views: 36

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28705

library(data.table)

Create a and b

a <- fread("
chr  pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000
", data.table = F)

b <- fread("
chr  pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000
", data.table = F)

setDT(a)
setDT(b)

@Jaap method, no extra columns needed

a[pos %inrange% b[, .(low = pos - 5000, high = pos + 5000)] & match(chr, b$chr)]
#     chr   pos
# 1: chr2 10000
# 2: chr2 45000
# 3: chr2 60000
# 4: chr2 80000

Join method, requiring extra columns to be added (same output)

Add bounds for considering pos "similar"

a[, `:=`(low = pos - 5000, high = pos + 5000)]

Join b with a according to these bounds.

b[a, .(chr, pos), on = .(pos >= low, pos <= high, chr = chr), nomatch = 0]

SQL method, no extra columns needed (same output)

library(sqldf)

sqldf("
select  a.*
from    a
        inner join b 
          on  a.chr = b.chr
              and b.pos between a.pos - 5000 and a.pos + 5000
")

Upvotes: 2

Related Questions