Cisco
Cisco

Reputation: 137

Compare and obtain intervals intersections between rows

I have a data base like the following.

pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n)

Position 1 and position 2 designed the start and end point of an interval for each chr and population. My intention is to obtain which interval intersects between pops A, B and C (not D) and which intervals are unique for each population.

So, for the unique intervals I would have an outcome data.frame like the following:

pos1.u<-c(25,50,92,20,30,27,90)
pos2.u<-c(30,52,100,25,32,30,100)
chr.u<-c(1,1,1,2,2,2,2)
pop.u<-c("A","B","C","A","A","B","C")
n.u<-c(78,98,74,48,98,21,12)
data.u<-data.frame(pos1.u,pos2.u,chr.u,pop.u,n.u)

And for the intervals that intersects between those 3 populations a data.frame like the following:

pos1.c<-c(5,15,40,80,5,38,85,5,16,50,70,50,60,50)
pos2.c<-c(10,17,42,90,10,42,87,10,19,60,90,60,71,60)
chr.c<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2)
pop.c<-c("A","A","A","A","B","B","B","C","C","A","A","B","B","C")
n.c<-c(25,65,56,35,78,14,25,65,85,52,69,47,53,10)
data.c<-data.frame(pos1.c,pos2.c,chr.c,pop.c,n.c)

I don't know how to write a script that does precisely this, can you help me?

Upvotes: 2

Views: 122

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

I think the following code does what you ask for, although it produces different results from yours - so please check it carefully! The discrepancy I think lies in the definition of open and closed intervals. The following assumes that neither end point is included, whereas I suspect this might not be what you mean (otherwise (15,18) and (17,19) would not count as overlapping, as there is no integer value that falls in both). So you might need to adjust the open/closed definitions below.

pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n,stringsAsFactors = FALSE)

library(intervals)
data<-data[data$pop!="D",] #remove irrelevant D entries
rownames(data) <- seq_len(nrow(data)) #reset rownames to allow for removed Ds

#set ints as a list of intervals (as required by intervals package)
ints <- tapply(1:nrow(data),data$pop,function(v) 
         Intervals(as.matrix(data[v,c("pos1","pos2")]),
         closed=c(FALSE,FALSE), #this is where you adjust open/closed lower and upper ends of the intervals - TRUE means end value included
         type="Z")) #Z is integers
pops <- unique(data$pop) #unique values of pop
popidx <- lapply(pops,function(x) which(data$pop==x)) #list of indices of these values in data
names(popidx) <- pops

#sets is a df of all pairwise combinations to check
sets <- expand.grid(pops,pops,stringsAsFactors = FALSE) 
sets <- sets[sets$Var1!=sets$Var2,]

olap <- lapply(1:nrow(sets),function(i) 
        interval_overlap(ints[[sets$Var1[i]]],ints[[sets$Var2[i]]])) #list of overlaps
olap <- lapply(1:nrow(sets),function(i) {
  df<-as.data.frame(olap[[i]],stringsAsFactors=FALSE)
  df$pos1 <- as.numeric(rownames(df))
  df$pos2 <- sapply(1:nrow(df),function(j) popidx[[sets$Var2[i]]][df[j,1][[1]][1]])
  return(df)}) #tidy up as dfs, with correct indices in data (rather than in ints)
olap <- do.call(rbind,olap)[,-1] #join dataframes
olap$olaps <- !is.na(olap$pos2) #identify those with overlaps

#group by unique pos1 and identify max and min no of overlaps with other groups
olap <- data.frame(minoverlap=tapply(olap$olaps,olap$pos1,min),maxoverlap=tapply(olap$olaps,olap$pos1,max))
olap$rowno <- as.numeric(rownames(olap))

uniques <- data[olap$rowno[olap$maxoverlap==0],] #intervals appearing in just one pop
commons <- data[olap$rowno[olap$minoverlap>0],] #intervals with an overlap in all other pops

Upvotes: 1

Related Questions