Matt Thornton
Matt Thornton

Reputation: 157

Using a for loop in R with a condition

I am trying to write a rscript that will find the annotation for ChIP-seq peaks using biomaRt from the bioconductor package. I am adapting the annotation code from here, https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/spp-r-from-chip-seq I need to find TSS sites within 2.5 to 5kb of the binding peak. Unlike in the example website, I have to do it across the entire genome.

I know that the code works for annotation - currently, I have the code block duplicated 22 times instead of with the loop.

I also need to find away to keep the script from exiting if there are no peaks on the chromosome that is being iterated over.

#!/usr/bin/Rscript --vanilla --slave

# Change to data directory
setwd("/data/met/bowtie_out/tAfiles/MLE15/2/");

# send the output to a file AND the Terminal
sink("09June2013_spp.txt", append=FALSE, split=TRUE);

# Load Libraries
library(biomaRt);
library(plyr);

load("MLE15_pooled_2.Rdata");

# y equals the SPP score. I have to truncate it after IDR analysis.

bp <- llply(region.peaks$npl, subset, y > 12.0633)

print(paste("After filtering",sum(unlist(lapply(bp,function(d) length(d$x)))),"peaks remain"));

save.image(file="09Jun13_1.RData");

# begin collecting annotation have to use mm10.

ensembl= useMart('ensembl', dataset='mmusculus_gene_ensembl', host="apr2013.archive.ensembl.org");

# need to make a for loop that will loop through all of the chromosomes and not error out if no peaks are on that chromosome.

# So for this 'for' loop in R do I need to make a list? like for (z in (c('1'....etc?

for ( z in ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X', 'Y', 'M' ) {

# To insert the variable which I am looping on with other variables, is there something I should add like the $ in bash for loops that is specific to R? can I put the variable in quotes - like for the input to biomaRt? ex. values= "z"?

genes.chrz = getBM(attributes = c("chromosome_name", "start_position", "end_position", "strand", "description", "entrezgene"), filters = "chromosome_name", values= "z", mart = ensembl);

overlap = function(bs, ts, l)
{
    if ((bs > ts - l) && (bs < ts + l)) {
        TRUE;
    } else {
        FALSE;
    }
}

fivePrimeGenes = function(bs, ts, te, s, l, n, c, d)
{
    fivePrimeVec = logical();
    for (i in 1:length(ts)) {
            fivePrime = FALSE;
            for (j in 1:length(bs)) {
                if (s[i] == 1) {
                    fivePrime = fivePrime || overlap(bs[j], ts[i], l);
                } else {
                    fivePrime = fivePrime || overlap(bs[j], te[i], l);
                }
             }
            fivePrimeVec = c(fivePrimeVec, fivePrime);
    }
     fivePrimeVec;
}

fivePrimeGenesLogical = fivePrimeGenes(bp$chrz$x, genes.chrz$start_position, genes.chrz$end_position, genes.chrz$strand, 5000, genes.chrz$entrezgene, genes.chrz$chromosome_name, genes.chrz$description);
fivePrimeStartsPlus = genes.chrz$start_position[fivePrimeGenesLogical & genes.chrz$strand == 1]
fivePrimeStartsMinus = genes.chrz$end_position[fivePrimeGenesLogical & genes.chrz$strand == -1]
fivePrimeStarts = sort(c(fivePrimeStartsPlus, fivePrimeStartsMinus));

entrezgene <- data.frame(genes.chrz$entrezgene[fivePrimeGenesLogical]);
chromosome_name <- data.frame(genes.chrz$chromosome_name[fivePrimeGenesLogical]);
start_pos <- data.frame(genes.chrz$start_position[fivePrimeGenesLogical]);
end_pos <- data.frame(genes.chrz$end_position[fivePrimeGenesLogical]);
strand <- data.frame(genes.chrz$strand[fivePrimeGenesLogical]);
description <- data.frame(genes.chrz$description[fivePrimeGenesLogical]);


AnnotationData <- cbind(chromosome_name, entrezgene, start_pos, end_pos, strand, description);
write.table(AnnotationData, file="chrz_annotation_data.csv", row.names=FALSE, col.names=FALSE, sep="\t");

}

save.image(file="09Jun13_2.RData");

# close the output file
sink()

# clean all
rm(list=ls(all=TRUE));

quit("yes");

Sorry to make such a messy question, but I figure that having the lines out there will help other people who are trying to annotate their peak files. Any help or advice is greatly appreciated!

Thanks!

Upvotes: 0

Views: 2540

Answers (1)

Martin Morgan
Martin Morgan

Reputation: 46876

Let's get the annotations; no need to loop over chromosomes at this stage

library(biomaRt)
ensembl <- useMart('ENSEMBL_MART_ENSEMBL', dataset='mmusculus_gene_ensembl',
                   host="apr2013.archive.ensembl.org")
genes.chrz <- getBM(attributes = c("chromosome_name", "start_position",
                      "end_position", "strand", "description", "entrezgene"),
                    mart = ensembl)

It'll be convenient to use functions from the GenomicRanges package, and in particular to represent our annotations (and our data) as a GRanges object.

library(GenomicRanges)
genes <- with(genes.chrz, {
    GRanges(chromosome_name, IRanges(start_position, end_position), strand, 
            entrezgene=entrezgene, description=description)
})

Looks like you're interested in the 5000 nt flanking (5' of) the start of each gene

flanks <- flank(genes, 5000)

Looks like you have some peak data that is in a data frame that you could make into a GRanges object; it's too hard to tell from what you've provided. Then it's easy to, e.g., count the number of peaks overlapping each flanking region or subset the peaks to contain just those that overlap flanking regions.

gr <- with(df, GRanges(chrom, IRanges(start, end), strand))
countOverlaps(flanks, gr)
subsetByOverlaps(gr, flanks)

There are extensive vignettes available from the GenomicRanges and IRanges landing pages; the Bioconductor mailing list is a helpful resource.

But if one really wants to work with the code Matt provides... I started by zapping all those ';' (not needed in R) and wrapping the code to 80 columns for readability. Then I pulled the constant function definitions outside the loop and revised overlap to operate on a vector ts, rather than a scalar

overlap = function(bs, ts, l)
{
    ## suppose ts is a vector, bs and l scalars, then use single '&';
    ## result is a logical(length(ts)) with appropriate values TRUE or
    ## FALSE
    (bs > ts - l) & (bs < ts + l)
}

for fivePrimeGenes I tried to exploit the vectorized overlap -- I don't have to iterate over ts, just invoke overlap once for each bs[j]. There are more efficient ways of doing this... I used seq_along so that I don't get caught by doing 1:0 when bs is of length 0, and removed arguments not used in the function

fivePrimeGenes = function(bs, ts, te, s, l)
{
    ## use 'pre-allocate and fill' rather than copy-and-append
    fivePrimeVec <- logical(length(ts)) # initially all FALSE
    ## choose strand for all elements of ts, te;
    se <- ifelse(s == 1, ts, te)
    ## overlap is vectorized, no need to iterate over elements of se
    ## use seq_along() to avoid edge case of 0-length ts
    for (j in seq_along(bs))
        ## calculate overlaps for each element of bs
        fivePrimeVec <- fivePrimeVec | overlap(bs[j], se, l)
    fivePrimeVec
}

OK, for the main loop, let R create the sequence 1:19, and coerce to character. Use the variable z rather than the string "z" when querying biomaRt, and when creating a file name to write data to. Use of with is a convenience / evil; it helped to tidy the code in this case so I chose to use it. fivePrimeStarts was not used in subsequent calculations, so I removed it. Seemed like AnnotationData could be created as a subset of the existing data frame, rather than error-prone assembly from subset parts.

for ( z in c( 1:19, "X", "Y", "M" ) )  {
    ## use the variable z, rather than the string "z"; order
    ## attributes as wanted in final representation
    attributes <- c("entrezgene", "chromosome_name", "start_position",
                    "end_position", "strand", "description",
                    "entrezgene")
    genes.chrz = getBM(attributes = attributes,
      filters = "chromosome_name", values=z, mart = ensembl)

    ## use 'with' to simplify common column selection
    fivePrimeGenesLogical = with(genes.chrz, {
        fivePrimeGenes(bp$chrz$x, start_position, end_position, strand, 5000)
    })

    ## AnnotationData as subset, rather than assembling
    AnnotationData <- genes.chrz[fivePrimeGenesLogical,, drop=FALSE]
    ## as a matrix? as.matrix(AnnotationData)

    ## use 'sprintf' to create a file name
    write.table(AnnotationData,
                file=sprintf("chr%s_annotation_data.csv", z),
                row.names=FALSE, col.names=FALSE, sep="\t")
}

I hope that helps; probably I introduced a lot of errors!

Upvotes: 4

Related Questions