Reputation: 165
I am fairly new to Python and I have an issue with my for
loop that I can't quite seem to figure out.
I am trying to read into a FASTA file which has the following example text:
>seq1
AAACTACCGCGTTT
>seq2
AAACTGCAACTAGCGTTT
>seq3
AAACCGGAGTTACCTAGCGTTT
What I would like to do is read into my file and print the FASTA header (e.g. header is >seq1), then I want to match two unique patterns (in this e.g. "AAA" and "TTT") present in the DNA sequence and print the DNA sequence that is between these two patterns.
So my will like my output to look like this:
>seq1
CTACCGCG
>seq2
CTGCAACTAGCG
>seq3
CCGGAGTTACCTAGCG
I have the following code:
import re
def find_seq(filename):
with open(filename) as file:
seq=''
for line in file:
header = re.search(r'^>\w+', line)
if(header):
print (header.group())
seq = seq.replace('\n','')
find_Lpattern = re.sub(r'.*AAA', '',seq)
find_Rpattern = re.sub(r'TTT.*', '',find_Lpattern)
if(find_Rpattern):
print (find_Rpattern)
seq = ''
else:
seq += line
filename = 'test.txt'
print(find_seq(filename))
I keep getting this as my output:
>seq1
>seq2
CTACCGCG
>seq3
CTGCAACTAGCG
Essentially my for loop skips over seq1 and then assigns the DNA sequence from seq1 to seq2, and the iteration on my for loop is off. Could anyone please point me in the right direction so I can fix this issue?
Upvotes: 0
Views: 867
Reputation: 37319
Even assuming your indentation is set in the way that would produce the results you describe, your logic is off. You're printing the header before you handle the accumulated seq
.
When you read line 1 of your file, your header
regexp matches. At that point, seq
is the empty string. It therefore prints the match, and runs your replace and re.sub
calls on the empty string.
Then it reads line 2, "AAACTACCGCGTTT", and appends that to seq
.
Then it reads line 3, ">seq2". That matches your header regexp, so it prints the header. Then in runs your replace and sub calls on seq
- which is still "AAACTACCGCGTTT" from line 2.
You need to move your seq
handling to before you print the headers, and consider what will happen when you run off the end of the file without finding a final header - you will still have 'seq' contents that you want to parse and print after your for loop has ended.
Or maybe look into the third-party biopattern library, which has the SeqIO
module to parse FASTA files.
Upvotes: 2