Reputation: 157
I am trying to write a rscript that will find the annotation for ChIP-seq peaks using biomaRt from the bioconductor package. I am adapting the annotation code from here, https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/spp-r-from-chip-seq I need to find TSS sites within 2.5 to 5kb of the binding peak. Unlike in the example website, I have to do it across the entire genome.
I know that the code works for annotation - currently, I have the code block duplicated 22 times instead of with the loop.
I also need to find away to keep the script from exiting if there are no peaks on the chromosome that is being iterated over.
#!/usr/bin/Rscript --vanilla --slave
# Change to data directory
setwd("/data/met/bowtie_out/tAfiles/MLE15/2/");
# send the output to a file AND the Terminal
sink("09June2013_spp.txt", append=FALSE, split=TRUE);
# Load Libraries
library(biomaRt);
library(plyr);
load("MLE15_pooled_2.Rdata");
# y equals the SPP score. I have to truncate it after IDR analysis.
bp <- llply(region.peaks$npl, subset, y > 12.0633)
print(paste("After filtering",sum(unlist(lapply(bp,function(d) length(d$x)))),"peaks remain"));
save.image(file="09Jun13_1.RData");
# begin collecting annotation have to use mm10.
ensembl= useMart('ensembl', dataset='mmusculus_gene_ensembl', host="apr2013.archive.ensembl.org");
# need to make a for loop that will loop through all of the chromosomes and not error out if no peaks are on that chromosome.
# So for this 'for' loop in R do I need to make a list? like for (z in (c('1'....etc?
for ( z in ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X', 'Y', 'M' ) {
# To insert the variable which I am looping on with other variables, is there something I should add like the $ in bash for loops that is specific to R? can I put the variable in quotes - like for the input to biomaRt? ex. values= "z"?
genes.chrz = getBM(attributes = c("chromosome_name", "start_position", "end_position", "strand", "description", "entrezgene"), filters = "chromosome_name", values= "z", mart = ensembl);
overlap = function(bs, ts, l)
{
if ((bs > ts - l) && (bs < ts + l)) {
TRUE;
} else {
FALSE;
}
}
fivePrimeGenes = function(bs, ts, te, s, l, n, c, d)
{
fivePrimeVec = logical();
for (i in 1:length(ts)) {
fivePrime = FALSE;
for (j in 1:length(bs)) {
if (s[i] == 1) {
fivePrime = fivePrime || overlap(bs[j], ts[i], l);
} else {
fivePrime = fivePrime || overlap(bs[j], te[i], l);
}
}
fivePrimeVec = c(fivePrimeVec, fivePrime);
}
fivePrimeVec;
}
fivePrimeGenesLogical = fivePrimeGenes(bp$chrz$x, genes.chrz$start_position, genes.chrz$end_position, genes.chrz$strand, 5000, genes.chrz$entrezgene, genes.chrz$chromosome_name, genes.chrz$description);
fivePrimeStartsPlus = genes.chrz$start_position[fivePrimeGenesLogical & genes.chrz$strand == 1]
fivePrimeStartsMinus = genes.chrz$end_position[fivePrimeGenesLogical & genes.chrz$strand == -1]
fivePrimeStarts = sort(c(fivePrimeStartsPlus, fivePrimeStartsMinus));
entrezgene <- data.frame(genes.chrz$entrezgene[fivePrimeGenesLogical]);
chromosome_name <- data.frame(genes.chrz$chromosome_name[fivePrimeGenesLogical]);
start_pos <- data.frame(genes.chrz$start_position[fivePrimeGenesLogical]);
end_pos <- data.frame(genes.chrz$end_position[fivePrimeGenesLogical]);
strand <- data.frame(genes.chrz$strand[fivePrimeGenesLogical]);
description <- data.frame(genes.chrz$description[fivePrimeGenesLogical]);
AnnotationData <- cbind(chromosome_name, entrezgene, start_pos, end_pos, strand, description);
write.table(AnnotationData, file="chrz_annotation_data.csv", row.names=FALSE, col.names=FALSE, sep="\t");
}
save.image(file="09Jun13_2.RData");
# close the output file
sink()
# clean all
rm(list=ls(all=TRUE));
quit("yes");
Sorry to make such a messy question, but I figure that having the lines out there will help other people who are trying to annotate their peak files. Any help or advice is greatly appreciated!
Thanks!
Upvotes: 0
Views: 2540
Reputation: 46876
Let's get the annotations; no need to loop over chromosomes at this stage
library(biomaRt)
ensembl <- useMart('ENSEMBL_MART_ENSEMBL', dataset='mmusculus_gene_ensembl',
host="apr2013.archive.ensembl.org")
genes.chrz <- getBM(attributes = c("chromosome_name", "start_position",
"end_position", "strand", "description", "entrezgene"),
mart = ensembl)
It'll be convenient to use functions from the GenomicRanges package, and in particular to represent our annotations (and our data) as a GRanges
object.
library(GenomicRanges)
genes <- with(genes.chrz, {
GRanges(chromosome_name, IRanges(start_position, end_position), strand,
entrezgene=entrezgene, description=description)
})
Looks like you're interested in the 5000 nt flanking (5' of) the start of each gene
flanks <- flank(genes, 5000)
Looks like you have some peak data that is in a data frame that you could make into a GRanges
object; it's too hard to tell from what you've provided. Then it's easy to, e.g., count the number of peaks overlapping each flanking region or subset the peaks to contain just those that overlap flanking regions.
gr <- with(df, GRanges(chrom, IRanges(start, end), strand))
countOverlaps(flanks, gr)
subsetByOverlaps(gr, flanks)
There are extensive vignettes available from the GenomicRanges and IRanges landing pages; the Bioconductor mailing list is a helpful resource.
But if one really wants to work with the code Matt provides... I started by zapping all those ';' (not needed in R) and wrapping the code to 80 columns for readability. Then I pulled the constant function definitions outside the loop and revised overlap
to operate on a vector ts
, rather than a scalar
overlap = function(bs, ts, l)
{
## suppose ts is a vector, bs and l scalars, then use single '&';
## result is a logical(length(ts)) with appropriate values TRUE or
## FALSE
(bs > ts - l) & (bs < ts + l)
}
for fivePrimeGenes
I tried to exploit the vectorized overlap
-- I don't have to iterate over ts
, just invoke overlap
once for each bs[j]
. There are more efficient ways of doing this... I used seq_along
so that I don't get caught by doing 1:0
when bs
is of length 0, and removed arguments not used in the function
fivePrimeGenes = function(bs, ts, te, s, l)
{
## use 'pre-allocate and fill' rather than copy-and-append
fivePrimeVec <- logical(length(ts)) # initially all FALSE
## choose strand for all elements of ts, te;
se <- ifelse(s == 1, ts, te)
## overlap is vectorized, no need to iterate over elements of se
## use seq_along() to avoid edge case of 0-length ts
for (j in seq_along(bs))
## calculate overlaps for each element of bs
fivePrimeVec <- fivePrimeVec | overlap(bs[j], se, l)
fivePrimeVec
}
OK, for the main loop, let R create the sequence 1:19, and coerce to character. Use the variable z
rather than the string "z" when querying biomaRt, and when creating a file name to write data to. Use of with
is a convenience / evil; it helped to tidy the code in this case so I chose to use it. fivePrimeStarts
was not used in subsequent calculations, so I removed it. Seemed like AnnotationData could be created as a subset of the existing data frame, rather than error-prone assembly from subset parts.
for ( z in c( 1:19, "X", "Y", "M" ) ) {
## use the variable z, rather than the string "z"; order
## attributes as wanted in final representation
attributes <- c("entrezgene", "chromosome_name", "start_position",
"end_position", "strand", "description",
"entrezgene")
genes.chrz = getBM(attributes = attributes,
filters = "chromosome_name", values=z, mart = ensembl)
## use 'with' to simplify common column selection
fivePrimeGenesLogical = with(genes.chrz, {
fivePrimeGenes(bp$chrz$x, start_position, end_position, strand, 5000)
})
## AnnotationData as subset, rather than assembling
AnnotationData <- genes.chrz[fivePrimeGenesLogical,, drop=FALSE]
## as a matrix? as.matrix(AnnotationData)
## use 'sprintf' to create a file name
write.table(AnnotationData,
file=sprintf("chr%s_annotation_data.csv", z),
row.names=FALSE, col.names=FALSE, sep="\t")
}
I hope that helps; probably I introduced a lot of errors!
Upvotes: 4