Reputation: 179
I wrote a code in python to read a DNA sequence(to do a motif alignment on them later) however, I'm looking for a more efficient way to do this.
See below if you can help:
handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
text_file.write(a)
f = 0
z = 100
b = ''
while f < len(a):
b += a[f:z]+'\n'
f += 1
z += 1
with open("2.txt", "w") as runner_mtfs:
runner_mtfs.write(b)
In summary, I want to do a bunch of analysis on each line of b, but I don't know of any more efficient way to do this, rather than separate each 100 base pairs. The output file is more than 500 megabytes. Any suggestions?
The first part of the code is just a DNA sequence, I'm joining all the lines together, and I'm separating 100 base pairs.
Upvotes: 4
Views: 544
Reputation: 15310
Here's a class that does a few things you might want.
"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.
Usage: 100pairs [-n <pairs>] [-p] <file>
<file> Input file.
-n,--numpairs <pairs> Use <pairs> per iteration. [default: 100]
-p,--partial Allow partial sequences at end of genome.
"""
import docopt
class GeneBuffer:
def __init__(self, path, bases=100, partial=True):
self._buf = None
self.bases = int(bases)
self.partial = partial
self.path = path
def __enter__(self):
self._file = open(self.path, 'r')
self._header = next(self._file)
return self
def __exit__(self, *args):
if self._file:
self._file.close()
def __iter__(self):
return self
def __next__(self):
if self._buf is None:
self._buf = ''
while self._file and len(self._buf) < self.bases:
try:
self._buf += next(self._file).strip()
except StopIteration:
self._file.close()
self._file = None
break
if len(self._buf) < self.bases:
if len(self._buf) == 0 or not self.partial:
raise StopIteration
result = self._buf[:self.bases]
self._buf = self._buf[1:]
return result
def analyze(basepairs):
"""
Dammit, Jim! I'm a computer programmer, not a geneticist!
"""
print(basepairs)
def main(args):
numpairs = args['--numpairs']
partial = args['--partial']
with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
print("Header: ", genome._header)
count = 0
for basepairs in genome:
count += 1
print(count, end=' ')
analyze(basepairs)
if __name__ == '__main__':
args = docopt.docopt(__doc__)
main(args)
Upvotes: 1
Reputation: 444
with open(...) file
). It's fast and it consumes less memory.It seems that you want to process data with fixed-sized sliding window. I would do it like this:
def load_fasta(fasta_file_name, sliding_window_size = 100):
buffer = ''
with open(fasta_file_name) as f:
for line in f:
if line.startswith('>'):
#skip or get some info from comment line
buffer = ''
else:
#read next line
buffer += line.strip('\r\n')
offset = 0 # zero-based offset for current string
while (offset + sliding_window_size <= len(buffer)):
next_sliding_window = buffer[offset : offset + sliding_window_size]
yield(next_sliding_window)
offset += 1
buffer = buffer[offset : ]
for str in load_fasta("a.fas.txt", 100):
#do some processing with sliding window data
print(str)
If you actually want to process portions of data with length less than 100 (or in my example, less than sliding window size
), you will have to slightly modify that function (at the appearance of new comment line and at the end of processing).
You can also biopython.
Upvotes: 1
Reputation: 9890
The major problem I see here is that you're writing everything out into a file. There's no point in doing this. The large output file you create is very redundant, and loading it back in when you do your analysis isn't helpful.
After you've loaded the file initially, every window you're interested in looking at is a[x:x+100]
for some x
. You shouldn't need to actually generate those windows explicitly at all: there shouldn't be any benefit to doing so. Go through, and generate those matrices from each window of a directly.
If you really do need the whole thing, generate it as a numpy array. I additionally, if I'm not using any degenerate base codes, convert the sequence into uint8s using 0,1,2,3 for A, C, G, T. This can help to speed up things, especially if you need to take complements at any point, which can be done as simple fiddling around with bits.
Numpy can generate the array quite efficiently using stride_tricks
, as noted in this blog post:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)
Or, converting to ints:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)
This code is around ten times faster than yours on my machine.
Upvotes: 1