Reputation: 2386
I am using python/pysam to do analyze sequencing data. In its tutorial (pysam - An interface for reading and writing SAM files) for the command mate it says:
'This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.'
How would you 'cache reads'?
Upvotes: 15
Views: 1051
Reputation: 554
I find the other answers do not address how to actually cache reads in practice.
Here is a simple way to do this:
from collections import defaultdict
from pysam import AlignmentFile
def get_mate(read_pairs, read):
if read.qname not in read_pairs or not (read.is_read1 ^ read.is_read2):
return None
pos = 1 if read.is_read1 else 0
return read_pairs[read.qname][pos]
# maps QNAME to a read pair
read_pairs = defaultdict(lambda : [None, None])
fin = AlignmentFile("your_filepath")
for read in fin.fetch(your_chrom,your_start,your_stop):
if read.is_paired and (read.is_read1 ^ read.is_read2):
pos = 0 if read.is_read1 else 1
read_pairs[read.qname][pos] = read
## Now compare execution time of these two commands
your_read_mate = fin.mate(your_read) # pysam, non-cached
your_read_mate = get_mate(read_pairs, your_read) # cached
In which the operational definition for a read pair is that (c.f. SAM format):
read.is_paired
)read.is_read1
) or 0x80 (read.is_read2
) set (the XOR read.is_read1 ^ read.is_read2
checks for this)On my machine, using ipython's %timeit
command, I get 18.9 ms ± 510 µs
for the non-cached call and 854 ns ± 28.7 ns
for the cached call for a given read (for which I know the pair is in read_pairs
) :-)
Upvotes: 0
Reputation: 418
AlignmentFile takes as the first argument:
filepath_or_object
So instead of supplying a filename, you can supply an object that supports a file-like interface, i.e. the methods seek
, read
, tell
.
When implementing a class for this, you can also implement caching on the reads, which of course have to depend on current cursor position.
If filesize is small enough so that it fits into memory, you can read the complete file and operate on an io.BytesIO
object, no need to make your own class:
data = io.BytesIO(open('datafile','rb').read())
your_object = AlignmentFile(data, <other args>)
I am not sure that this will speed things up much, because i assume that modern operating systems (i know linux will do that) do cache file access. So maybe it is enough to rely on that.
Upvotes: 2
Reputation: 15060
Caching is a typical approach to speed up long running operations. It sacrifices memory for the sake of computational speed.
Let's suppose you have a function which given a set of parameters always returns the same result. Unfortunately this function is very slow and you need to call it a considerable amount of times slowing down your program.
What you could do, is storing a limited amount of {parameters: result} combinations and skip its logic any time the function is called with the same parameters.
It's a dirty trick but quite effective especially if the parameters combination is low compared to the function speed.
In Python 3 there's a decorator for this purpose.
In Python 2 a library can help but you need a bit more work.
Upvotes: 9