Reputation: 1112
I have a dataframe df1:
df1 <- read.table(text=" Chr06 79641
Chr06 82862
Chr06 387314
Chr06 656098
Chr06 678491
Chr06 1018696", header=FALSE, stringsAsFactors=FALSE)
I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I have a few thousands rows in df1. I would like to find a more efficient way to do this job (better no looping). Thanks. The df2 data frame:
df2 <- read.table(text=" Chr05 799 870
Chr06 77914 77942
Chr06 78233 78269
Chr06 78719 78836
Chr06 79720 87043
Chr06 87223 87305
Chr06 380020 380060
Chr06 387314 387371
Chr06 654907 654988
Chr06 657929 658057
Chr06 677198 677229
Chr06 679555 680170
Chr06 1015425 1015475
Chr06 1018676 1018736
Chr06 1020564 1020592", header=FALSE, stringsAsFactors=FALSE)
My script:
df1$V3 <- FALSE
for (i in 1:dim(df1)[1]) {
for (j in 1:dim(df2)[1]) {
if ((df1[i,1] == df2[j,1]) && (df1[i,2]>=df2[j,2])
&& (df1[i,2]<=df2[j,3])) {
df1[i,3]<-TRUE
break;
}
}
}
df1
The expected result is shown as df1.
Upvotes: 2
Views: 173
Reputation: 56004
Using GenomicRanges:
#Convert to Granges objects
gr1 <- GRanges(seqnames = df1$V1,
ranges = IRanges(df1$V2, df1$V2))
gr2 <- GRanges(seqnames = df2$V1,
ranges = IRanges(df2$V2, df2$V3))
#Subset gr1
subsetByOverlaps(gr1, gr2)
# GRanges object with 3 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] Chr06 [ 82862, 82862] *
# [2] Chr06 [ 387314, 387314] *
# [3] Chr06 [1018696, 1018696] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#Or we can use merge
mergeByOverlaps(gr1, gr2)
# DataFrame with 3 rows and 2 columns
# gr1 gr2
# <GRanges> <GRanges>
# 1 Chr06:*:[ 82862, 82862] Chr06:*:[ 79720, 87043]
# 2 Chr06:*:[ 387314, 387314] Chr06:*:[ 387314, 387371]
# 3 Chr06:*:[1018696, 1018696] Chr06:*:[1018676, 1018736]
Also, look into bedtools:
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.
Upvotes: 6
Reputation: 54237
Here is a data.table
solution as an alternative to GenomicRanges
:
library(data.table)
dt1 <- data.table(df1)[, V3 := V2]
dt2 <- data.table(df2, key = c("V2", "V3"))
foverlaps(dt1, dt2)[V1 == i.V1][, -c(4, 6), with = F]
# V1 V2 V3 i.V3
# 1: Chr06 79720 87043 82862
# 2: Chr06 387314 387371 387314
# 3: Chr06 1018676 1018736 1018696
Upvotes: 6
Reputation: 24945
You can do this using sapply
:
sapply(1:nrow(df1), function(x) any(df1[x,2] >= df2$V2 &
df1[x,2] <= df2$V3 &
df1[x, 1] == df2$V1))
[1] FALSE TRUE TRUE FALSE FALSE TRUE
Upvotes: 1