user3224522
user3224522

Reputation: 1151

How to find and append gene name based on overlapping chr start and stop positions?

I have df1:

1 25 85
1 2000 3000
2 345 2300

And df2:

1 34 45 geneX
1 100 1000 geneD
2 456 1500 geneH

The desired output:

1 25 85 geneX
1 2000 3000 NA
2 345 2300 geneH

I have tried in two ways:

library(data.table)
setDT(df1, key = names(df1))
setDT(df2, key = key(df1))
overlaps <- foverlaps(df1, df2, type = "any", nomatch = 0L)[, -c("chromosome","start", "stop")]

This above code gives me some region multiple times...

rangesC <- NULL
sb <- NULL
sb$gene <- NULL
for(i in levels(df1$chromosome)){
  sb <- subset(df1, df1$chromosome == i)
  s <- subset(df2, df2$chromosome == i)
  for(j in 1:nrow(sb)){
    sb$gene[j] <- as.character(s$gene[which(s$start < sb$start[j] &  s$stop > sb$stop[j])])
  }
  rangesC <- rbind(rangesC, sb)
}

But this code is not working. I preferably would like to maintain and use the methods I tried above.

Upvotes: 1

Views: 93

Answers (3)

akrun
akrun

Reputation: 887098

A faster option would be a non-equi join from data.table

library(data.table)
setDT(df1)[df2, on = .(V1, V2 <= V2, V3 >= V3)]

Upvotes: 0

elielink
elielink

Reputation: 1202

Here is my solution, I use the library dplyr

Here the data you showed us:

a = matrix(c(25 ,85,
 2000 ,3000,
 345 ,2300),byrow = T,ncol = 2)

b =  matrix(c(34 ,45 ,'geneX',
 100, 1000 ,'geneD',
 456, 1500 ,'geneH'),ncol=3,byrow=T)

and here is my solution, I use a for loop:

res = matrix(ncol=3)[-1,]
if(nrow(a)==nrow(b)){
for(i in 1:nrow(a)){
  
  if(between(as.numeric(b[i,1]),a[i,1],a[i,2])&between(as.numeric(b[i,2]),a[i,1],a[i,2])){
    res= rbind(res, c(a[i,],b[i,3]))
  }else{
    res=rbind(res, c(a[i,],NA))
  }
  
}
}
res


#> res
#     [,1]   [,2]   [,3]   
#[1,] "25"   "85"   "geneX"
#[2,] "2000" "3000" NA     
#[3,] "345"  "2300" "geneH"

I think you can remove all the as.numeric in your case cause your data seem to be already numerical

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

You can fuzzyjoin package :

fuzzyjoin::fuzzy_left_join(df1, df2, 
                           by = c('V1', 'V2', 'V3'), 
                           match_fun = c(`==`, `<=`, `>=`)) %>%
  dplyr::select(V1 = V1.x, V2 = V2.x, V3 = V3.x, V4)

#  V1   V2   V3    V4
#1  1   25   85 geneX
#2  1 2000 3000  <NA>
#3  2  345 2300 geneH

Upvotes: 0

Related Questions