Reputation: 1151
I have df1:
1 25 85
1 2000 3000
2 345 2300
And df2:
1 34 45 geneX
1 100 1000 geneD
2 456 1500 geneH
The desired output:
1 25 85 geneX
1 2000 3000 NA
2 345 2300 geneH
I have tried in two ways:
library(data.table)
setDT(df1, key = names(df1))
setDT(df2, key = key(df1))
overlaps <- foverlaps(df1, df2, type = "any", nomatch = 0L)[, -c("chromosome","start", "stop")]
This above code gives me some region multiple times...
rangesC <- NULL
sb <- NULL
sb$gene <- NULL
for(i in levels(df1$chromosome)){
sb <- subset(df1, df1$chromosome == i)
s <- subset(df2, df2$chromosome == i)
for(j in 1:nrow(sb)){
sb$gene[j] <- as.character(s$gene[which(s$start < sb$start[j] & s$stop > sb$stop[j])])
}
rangesC <- rbind(rangesC, sb)
}
But this code is not working. I preferably would like to maintain and use the methods I tried above.
Upvotes: 1
Views: 93
Reputation: 887098
A faster option would be a non-equi join from data.table
library(data.table)
setDT(df1)[df2, on = .(V1, V2 <= V2, V3 >= V3)]
Upvotes: 0
Reputation: 1202
Here is my solution, I use the library dplyr
Here the data you showed us:
a = matrix(c(25 ,85,
2000 ,3000,
345 ,2300),byrow = T,ncol = 2)
b = matrix(c(34 ,45 ,'geneX',
100, 1000 ,'geneD',
456, 1500 ,'geneH'),ncol=3,byrow=T)
and here is my solution, I use a for loop:
res = matrix(ncol=3)[-1,]
if(nrow(a)==nrow(b)){
for(i in 1:nrow(a)){
if(between(as.numeric(b[i,1]),a[i,1],a[i,2])&between(as.numeric(b[i,2]),a[i,1],a[i,2])){
res= rbind(res, c(a[i,],b[i,3]))
}else{
res=rbind(res, c(a[i,],NA))
}
}
}
res
#> res
# [,1] [,2] [,3]
#[1,] "25" "85" "geneX"
#[2,] "2000" "3000" NA
#[3,] "345" "2300" "geneH"
I think you can remove all the as.numeric in your case cause your data seem to be already numerical
Upvotes: 2
Reputation: 388982
You can fuzzyjoin
package :
fuzzyjoin::fuzzy_left_join(df1, df2,
by = c('V1', 'V2', 'V3'),
match_fun = c(`==`, `<=`, `>=`)) %>%
dplyr::select(V1 = V1.x, V2 = V2.x, V3 = V3.x, V4)
# V1 V2 V3 V4
#1 1 25 85 geneX
#2 1 2000 3000 <NA>
#3 2 345 2300 geneH
Upvotes: 0