Finding sub intervals from interval data frames

Question

I have two data.frames with coordinates of linear intervals, which correspond to ids. Each id has several linear intervals. One of the data.frames is called exon.df:

exon.df <- data.frame(id=c(rep("id1",4),rep("id2",3),rep("id3",5)),
                      start=c(10,20,30,40,100,200,300,1000,2000,3000,4000,5000),
                      end=c(15,25,35,45,150,250,350,1500,2500,3500,4500,5500))

And the other cds.df:

cds.df <- data.frame(id=c(rep("id1",3),rep("id2",3),rep("id3",3)),
                      start=c(20,30,40,125,200,300,2250,3000,4000),
                      end=c(25,35,45,150,250,325,2500,3500,4250))

They both have the same ids but the intervals of cds.df are contained within those of exon.df. The intervals in exons.df are exons of genes (parts of the genome which are copied and stitched together to make a transcript of the gene), and those in cds.df are the parts of these exons that will be translated to protein since exons of the gene transcript also contain parts that will not be translated (Un-Translated Regions - utr). These utr's can only be located at the start and end of the gene transcript. The utr in the start is called 5'utr and the utr in the end is called 3'utr. A utr may either not exists at all, or span anywhere between part of a single or more exons from each end of the gene.

This means that the 5'utr of an id starts from the id's first position of its first interval in exon.df to one position before its first interval in cds.df, and includes all the exons in exon.df in between if such exist. Similarly, the 3'utr of an id starts one position after its last interval in cds.df to the last position of its last interval in exon.df, and includes all the exons in exons.df in between if such exist. It's also possible that an id will not have either or both utrs if the first position of its first interval in cds.df is its first position in its first interval in exon.df, and similarly if its last position of its last interval in cds.df is its last position in its last interval in exon.df.

I'm looking for a fast way to retrieve these 5'utr and 3'utr intervals give exon.df and cds.df.

Here's what the outcome for this example should be:

utr5.df <- data.frame(id=c("id1","id2","id3","id3"),
                     start=c(10,100,1000,2000),
                     end=c(15,124,1500,2249))

utr3.df <- data.frame(id=c("id2","id3","id3"),
                     start=c(326,4251,5000),
                     end=c(350,4500,5500))

Alexander Engelhardt · Accepted Answer

Do you know about Bioconductor? It's an add-on for R, specifically for the biosciences. It has a package called GenomicRanges, with which you can create a GRanges object that contains all Exons, and another object that contains all CDSs.

You can then do a set difference of these two objects to get the UTRs. Check the section "setops-methods" here. You want the 'setdiff' function.

So: Transform your data.frames into GRanges objects, then issue something like utrs <- setdiff(exons, cds)

Finding sub intervals from interval data frames

Answers (1)

Related Questions