Reputation: 35
I have a two big files with Start
and End
positions and two sample columns (numeric).
File 1:
Start End Sample1 Sample2
1 60 1 4
100 200 2 1
201 250 1 4
300 450 1 1
File 2:
Start End Sample1 Sample2
40 60 1 1
70 180 1 1
240 330 2 1
340 450 1 4
500 900 1 4
980 1200 2 1
First, I would like to take the first Start
and End
positions from first file and make a segment plot. The plot must also take into account Start-20
and End+20
for each position in first file.
Then I would like to take the overlapping Start
and End
positions from the second file and plot it on the above plot. In this way there will be many plots based on the Start
and End
positions from the first file and the one with out overlaps will also be plotted individually.
The color
for each segment will be based on the two sample numbers (for example in both the files if its 1 and 4
the color of segment will be red
, if its 1 and 1
the color of segment will be green
and so on).
I would really appreciate if someone make me understand how to make function for this in R.
Thanks in advance.
PS I have attached the drawing for an output. I have shown only two results.
Below is the code which I wrote but it gives an error
Error in match.names(clabs, names(xi)) : names do not match previous names
Also I need to specify red color to dataset1 line segment and green color to dataset2 line segments. How will I implement it in the code below?
overlap_func <- function(dataset1,dataset2) {
for(i in 1:nrow(dataset1))
{
loop_start <- dataset1[i,"Start"]
loop_end <- dataset1[i,"End"]
p <- dataset2[,c(1,2)]
dataset1_pos <- data.frame(loop_start,loop_end)
dataset2_filter <- p[p$Start >= (loop_start-(loop_start/2)) & p$End <= (loop_end+ (loop_end/2)), ]
data_in_loop <- rbind(dataset1_pos,dataset2_filter)
plot_function(data_in_loop,loop_start,loop_end)
}
}
plot_function <- function(loop_data,start,end){
pos <- 1:nrow(loop_data)
dat1 <- cbind(pos,loop_data)
colnames(dat1) <- c("pos","start","end")
pdf(file=paste0("path where plots are generated","_",start,"-",end,"_","overlap.pdf"))
plot(dat1$pos, type = 'n', xlim = range(c(start-(start/2), end+(end/2))))
segments(dat1$start, dat1$pos, dat1$end, dat1$pos)
dev.off()
}
df1 <- read.table(header=T, text="Start End Sample1 Sample2
1 60 1 4
100 200 2 1
201 250 1 4
300 450 1 1")
df2 <- read.table(header=T, text="Start End Sample1 Sample2
40 60 1 1
70 180 1 1
240 330 2 1
340 450 1 4
500 900 1 4
980 1200 2 1")
overlap_func(df1,df2)
Upvotes: 3
Views: 901
Reputation: 118839
Something like this??
df1 <- read.table(header=T, text="Start End Sample1 Sample2
1 60 1 4
100 200 2 1
201 250 1 4
300 450 1 1")
df2 <- read.table(header=T, text="Start End Sample1 Sample2
40 60 1 1
70 180 1 1
240 330 2 1
340 450 1 4
500 900 1 4
980 1200 2 1")
require(IRanges)
require(ggplot2)
require(plyr)
df1$id <- factor(1:nrow(df1))
ir2 <- IRanges(df2$Start, df2$End)
out <- ddply(df1, .(id), function(x) {
ir1 <- IRanges(x$Start, x$End)
o.idx <- as.data.frame(findOverlaps(ir1, ir2))$subjectHits
df.out <- rbind(x[, 1:4], df2[o.idx, ])
df.out$id1 <- x$id
df.out$id2 <- seq_len(nrow(df.out))
df.out
})
out$id1 <- factor(out$id1)
out$id2 <- factor(out$id2)
out$id3 <- factor(1:nrow(out))
p <- ggplot(out, aes(x = Start, y = id3 , colour = id2))
p <- p + geom_segment(aes(xend = End, ystart = id3, yend = id3))
p <- p + scale_colour_brewer(palette = "Set1")
p
Edit: After looking at your updated drawing, maybe this is what you want?
p + facet_wrap( ~ id1, scales="free")
Edit: Save each plot in facet in separate file. You can do this by generating the plot each time by splitting on id1
d_ply(out, .(id1), function(ww) {
p <- ggplot(ww, aes(x = Start, y = id3 , colour = id2))
p <- p + geom_segment(aes(xend = End, ystart = id3, yend = id3))
p <- p + scale_colour_brewer(palette = "Set1")
fn <- paste0("~/Downloads/id", as.numeric(as.character(ww$id1[1])), ".pdf")
ggsave(fn, p)
})
Set the path in fn
accordingly.
Upvotes: 2
Reputation: 121588
I tried to solve this problem using lattice
package. I specially use function Shingle
to get an idea of the intervals to compare. I hoped I can merge the 2 shingles but I can't. So, once I have a my first plot, I used ( as in the above solution) IRanges
package to compute the overlaps. The idea is have a final dotplot
.
## I red the input data
dat <- read.table(text = 'Start End Sample1 Sample2
1 60 1 4
100 200 2 1
201 250 1 4
300 450 1 1', header = T)
dat1 <- read.table(text = 'Start End Sample1 Sample2
40 60 1 1
70 180 1 1
240 330 2 1
340 450 1 4
500 900 1 4
980 1200 2 1', header = T)
## I create my 2 shingles
dat.sh <- shingle(x = dat[,3], intervals = dat[,c(1,2)])
dat1.sh <- shingle(x = dat1[,3], intervals = dat1[,c(1,2)])
## compute max value for plot comparison
max.value <- max(c(dat$End,dat1$End))
## I plot the 2 series with differents color
p1<- plot(dat.sh, xlim= c(0,max.value),col = 'red')
p2 <- plot(dat1.sh,xlim= c(0,max.value), col ='green')
library(gridExtra)
grid.arrange(p1,p2)
This is a fast method to compare my intervals.
This looks Ok, but I can't go further with shingle since I can't merge them in the same plot. SO I will use IRanges
package to compute the overlaps.
library(IRanges)
rang1 <- IRanges(start=dat[,1], end = dat[,2])
rang2 <- IRanges(start=dat1[,1], end = dat1[,2])
dat.plot <- dat1 # use the first data.frame
dat.plot$group <- 'origin'
dat.plot$id <- rownames(dat1) ## add an Id for each row
rang.o <- findOverlaps(rang2,rang1) # get overlaps
dat.o <- dat1[rang.o@queryHits,] ## construct overlaps data.frame
dat.o$id <- rang.o@subjectHits
dat.o$group <- 'overlap'
dat.plot <- rbind(dat.plot,dat.o) ## union of all
dotplot(id ~End-Start|group , data=dat.plot,
groups = col,type = c("p", "h"))
Upvotes: 1