Reputation: 8626
I have one table with coordinates (start
, end
) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords
table I need to search a record in dtFrags
table having the same chr
and start
<=coord
<=end
(and retrieve the type
from this record of dtFrags
). Is it good idea at all to use R for this, or I should rather look to other languages?
Here is my example:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
At the end, I would like to have something like this:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
I can simplify a bit the task by splitting the table to subtables by chr
, so I would concentrate only on coordinates
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)
Upvotes: 6
Views: 1779
Reputation: 15458
Does this work?
You can use merge
first and then subset
kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 2 300 500 intron 20 300
3: 2 4 250 600 exon 20 300
4: X 3 400 600 intron NA NA
kk[coord>=start & coord<=end]
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 4 250 600 exon 20 300
Upvotes: 3
Reputation: 118779
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
Upvotes: 7