mschilli
mschilli

Reputation: 1894

Subset SAM/BAM file in R

I have a BAM file with lots of reads. I can load it into R with scanBam from Rsamtools.

However, I only need a subset of reads. I have a character vector with the qnames I am interested in.

scanBam returns a list with 1 element which is a list with 13 elements which contain data for all the thousands of reads.

How can I subset this object by qname preserving the structure? I was not able to find anything in the manual or online.

Upvotes: 1

Views: 3278

Answers (2)

mschilli
mschilli

Reputation: 1894

I ended up using subset(DataFrame(scanBam(bam_file)[[1]]),qname %in% qname_vector). This does not keep the exact same structure (list of list) but all the information is kept and easily accessible.

Upvotes: 1

Martin Morgan
Martin Morgan

Reputation: 46856

It's probably more convenient to input the data with GenomicAlignments::readGAlignments, including the qname by specifying param=ScanBamParam(what="qname") as an argument. You could then subset with %in%. Here's a more complete example, using one of the ExperimentData packages

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)

fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]    
want <- c("ERR127306.11930623", "ERR127306.24720935",
    "ERR127306.23706011", "ERR127306.22418829", "ERR127306.13372247",
    "ERR127306.20686334", "ERR127306.11412145", "ERR127306.4711647",
    "ERR127306.7479582", "ERR127306.12737243")
aln <- readGAlignments(fname, param=ScanBamParam(what="qname"))
aln[mcols(aln)$qname %in% want]

BAM files are of course big, and the qnames are a big part of that; it often makes sense to iterate through the file in chunks. This is enabled (in the current Rsamtools) with yieldReduce, where one provides BamFile with yieldSize set to a reasonable (e.g., 1M) number of reads, a MAP function to input a chunk of data and process it (e.g., filtering the unwanted reads), an (optional) REDUCE function to concatenate results, and an (optional) DONE function to indicate when the iteration is done. The solution looks like (yieldSize is artificially small, to allow illustration with the sample data):

bfl <- BamFile(fname, yieldSize=100000)  ## larger, e.g., 1M-5M
MAP <- function(bfl, want) {
    ## message("tick")
    aln <- readGAlignments(bfl, param=ScanBamParam(what="qname"))
    if (length(aln) == 0)
        NA                          # semaphore -- DONE
    else
        aln[mcols(aln)$qname %in% want]
}
REDUCE <- c
DONE <- function(x) identical(x, NA)
result <- yieldReduce(bfl, MAP, REDUCE, DONE, want=want)

One could adopt a similar approach using scanBam, but the data structure (list-of-lists) is more complicated to deal with:

x <- scanBam(fname, param=ScanBamParam(what=c("qname", "pos")))
keep <- lapply(lapply(x, "[[", "qname"), "%in%", want)
result <- Map(function(elts, keep) {
    lapply(elts, "[", keep)
}, x, keep)

This could also be used with yieldReduce.

If you're interested in creating a new bam file with the filtered reads, then

filter_factory <- function(want) {
    list(KeepQname = function(x) x$qname %in% want)
}
filter <- FilterRules(filter_factory(want))
dest <- filterBam(fname, tempfile(), filter=filter,
                  param=ScanBamParam(what="qname"))
readGAlignments(dest)

Upvotes: 2

Related Questions