How to cache reads?

I am using python/pysam to do analyze sequencing data. In its tutorial (pysam - An interface for reading and writing SAM files) for the command mate it says:

'This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.'

How would you 'cache reads'?

Upvotes: 15

Answers (3)

bricoletc

Reputation: 554

I find the other answers do not address how to actually cache reads in practice.

Here is a simple way to do this:

from collections import defaultdict

from pysam import AlignmentFile

def get_mate(read_pairs, read):
    if read.qname not in read_pairs or not (read.is_read1 ^ read.is_read2):
      return None
    pos = 1 if read.is_read1 else 0
    return read_pairs[read.qname][pos]

# maps QNAME to a read pair
read_pairs = defaultdict(lambda : [None, None])

fin = AlignmentFile("your_filepath")

for read in fin.fetch(your_chrom,your_start,your_stop):
    if read.is_paired and (read.is_read1 ^ read.is_read2):
        pos = 0 if read.is_read1 else 1
        read_pairs[read.qname][pos] = read

## Now compare execution time of these two commands
your_read_mate = fin.mate(your_read) # pysam, non-cached
your_read_mate = get_mate(read_pairs, your_read) # cached

In which the operational definition for a read pair is that (c.f. SAM format):

The two reads have the same QNAME
Each read has flag 0x1 set (read.is_paired)
Each read has only one of flag 0x40 (read.is_read1) or 0x80 (read.is_read2) set (the XOR read.is_read1 ^ read.is_read2 checks for this)

On my machine, using ipython's %timeit command, I get 18.9 ms ± 510 µs for the non-cached call and 854 ns ± 28.7 ns for the cached call for a given read (for which I know the pair is in read_pairs) :-)

Upvotes: 0

pixelbrei

Reputation: 418

AlignmentFile takes as the first argument:

filepath_or_object

So instead of supplying a filename, you can supply an object that supports a file-like interface, i.e. the methods seek, read, tell. When implementing a class for this, you can also implement caching on the reads, which of course have to depend on current cursor position.

If filesize is small enough so that it fits into memory, you can read the complete file and operate on an io.BytesIO object, no need to make your own class:

data = io.BytesIO(open('datafile','rb').read())
your_object = AlignmentFile(data, <other args>)

I am not sure that this will speed things up much, because i assume that modern operating systems (i know linux will do that) do cache file access. So maybe it is enough to rely on that.

Upvotes: 2

noxdafox

Reputation: 15060

Caching is a typical approach to speed up long running operations. It sacrifices memory for the sake of computational speed.

Let's suppose you have a function which given a set of parameters always returns the same result. Unfortunately this function is very slow and you need to call it a considerable amount of times slowing down your program.

What you could do, is storing a limited amount of {parameters: result} combinations and skip its logic any time the function is called with the same parameters.

It's a dirty trick but quite effective especially if the parameters combination is low compared to the function speed.

In Python 3 there's a decorator for this purpose.
In Python 2 a library can help but you need a bit more work.

Upvotes: 9

How to cache reads?

Answers (3)

Related Questions