Reputation: 179

How to read a DNA sequence more efficiently?

I wrote a code in python to read a DNA sequence(to do a motif alignment on them later) however, I'm looking for a more efficient way to do this.

See below if you can help:

handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
    text_file.write(a)

f = 0
z = 100
b = ''
while f < len(a):
    b += a[f:z]+'\n'
    f += 1
    z += 1
with open("2.txt", "w") as runner_mtfs:
   runner_mtfs.write(b)

In summary, I want to do a bunch of analysis on each line of b, but I don't know of any more efficient way to do this, rather than separate each 100 base pairs. The output file is more than 500 megabytes. Any suggestions?

The first part of the code is just a DNA sequence, I'm joining all the lines together, and I'm separating 100 base pairs.

Upvotes: 4

Answers (3)

aghast

Reputation: 15310

Here's a class that does a few things you might want.

"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.

Usage: 100pairs [-n <pairs>] [-p] <file>

<file>                 Input file.
-n,--numpairs <pairs>  Use <pairs> per iteration. [default: 100]
-p,--partial           Allow partial sequences at end of genome.
"""
import docopt

class GeneBuffer:
    def __init__(self, path, bases=100, partial=True):
        self._buf = None
        self.bases = int(bases)
        self.partial = partial
        self.path = path

    def __enter__(self):
        self._file = open(self.path, 'r')
        self._header = next(self._file)
        return self

    def __exit__(self, *args):
        if self._file:
            self._file.close()

    def __iter__(self):
        return self

    def __next__(self):
        if self._buf is None:
            self._buf = ''

        while self._file and len(self._buf) < self.bases:
            try:
                self._buf += next(self._file).strip()
            except StopIteration:
                self._file.close()
                self._file = None
                break

        if len(self._buf) < self.bases:
            if len(self._buf) == 0 or not self.partial:
                raise StopIteration

        result = self._buf[:self.bases]
        self._buf = self._buf[1:]

        return result

def analyze(basepairs):
    """
    Dammit, Jim! I'm a computer programmer, not a geneticist!
    """
    print(basepairs)

def main(args):
    numpairs = args['--numpairs']
    partial = args['--partial']
    with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
        print("Header: ", genome._header)
        count = 0
        for basepairs in genome:
            count += 1
            print(count, end=' ')
            analyze(basepairs)

if __name__ == '__main__':
    args = docopt.docopt(__doc__)
    main(args)

Upvotes: 1

merletta

Reputation: 444

If it is .fasta-like file, there is a good chance that it contains more than 1 sequence.
There are a lot of examples of reading large files in python on stackoverflow, some useful ways are given here. I usually use recipe given in top answer for that question (with open(...) file). It's fast and it consumes less memory.

It seems that you want to process data with fixed-sized sliding window. I would do it like this:

def load_fasta(fasta_file_name, sliding_window_size = 100):
  buffer = ''
  with open(fasta_file_name) as f:
    for line in f:
      if line.startswith('>'):
        #skip or get some info from comment line
        buffer = ''
      else:
        #read next line
        buffer += line.strip('\r\n')
        offset = 0 # zero-based offset for current string
        while (offset + sliding_window_size <= len(buffer)):
          next_sliding_window = buffer[offset : offset + sliding_window_size]
          yield(next_sliding_window)
          offset += 1
        buffer = buffer[offset : ]

for str in load_fasta("a.fas.txt", 100):
  #do some processing with sliding window data
  print(str)

If you actually want to process portions of data with length less than 100 (or in my example, less than sliding window size), you will have to slightly modify that function (at the appearance of new comment line and at the end of processing).

You can also biopython.

Upvotes: 1

cge

Reputation: 9890

The major problem I see here is that you're writing everything out into a file. There's no point in doing this. The large output file you create is very redundant, and loading it back in when you do your analysis isn't helpful.

After you've loaded the file initially, every window you're interested in looking at is a[x:x+100] for some x. You shouldn't need to actually generate those windows explicitly at all: there shouldn't be any benefit to doing so. Go through, and generate those matrices from each window of a directly.

If you really do need the whole thing, generate it as a numpy array. I additionally, if I'm not using any degenerate base codes, convert the sequence into uint8s using 0,1,2,3 for A, C, G, T. This can help to speed up things, especially if you need to take complements at any point, which can be done as simple fiddling around with bits.

Numpy can generate the array quite efficiently using stride_tricks, as noted in this blog post:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)

Or, converting to ints:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)

This code is around ten times faster than yours on my machine.

Upvotes: 1

How to read a DNA sequence more efficiently?

Answers (3)

Related Questions