Reputation: 25
I'm downloading mtDNA records off NCBI and trying to extract lines from them using Python. The lines I'm trying to extract either start with or contain certain keywords such as 'haplotype' and 'nationality' or 'locality'. I've tried the following code:
import re
infile = open('sequence.txt', 'r') #open in file 'infileName' to read
outfile = open('results.txt', 'a') #open out file 'outfileName' to write
for line in infile:
if re.findall("(.*)haplogroup(.*)", line):
outfile.write(line)
outfile.write(infile.readline())
infile.close()
outfile.close()
The output here only contains the first line containing 'haplogroup' and for example not the following line from the infile:
/haplogroup="T2b20"
I've also tried the following:
keep_phrases = ["ACCESSION", "haplogroup"]
for line in infile:
for phrase in keep_phrases:
if phrase in line:
outfile.write(line)
outfile.write(infile.readline())
But this doesn't give me all of the lines containing ACCESSION and haplogroup.
line.startswith
works but I can't use this for lines where the word is in the middle of the line.
Could anyone give me an example piece of code to print the following line to my output for containing 'locality':
/note="origin_locality:Wales"
Any other advice for how I can extract lines containing certain words is also appreciated.
Edit:
/haplogroup="L2a1l2a"
/note="ethnicity:Ashkenazic Jewish;
origin_locality:Poland: Warsaw; origin_coordinates:52.21 N
21.05 E"
/note="TAA stop codon is completed by the addition of 3' A
residues to the mRNA"
/note="codons recognized: UCN"
In this case, using Peter's code, the first three lines are written to the outfile but not the line containing 21.05 E"
. How can I make an exception for /note="
and copy all of the lines until the second set of quotation marks, without copying the /note
lines containing /note="TAA
or /note="codons
edit2:
This is my current solution which is working for me.
stuff_to_write = []
multiline = False
with open('sequences.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases) or multiline:
do_not_write = False
if multiline and line.count('"') >= 1:
multiline = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
elif line.count('"') < 2:
multiline = True
if not do_not_write:
stuff_to_write.append(line)
Upvotes: 2
Views: 2391
Reputation: 3495
This will search a file for matching phrases and will write those lines to a new file assuming anything after "note"
doesn't match anything in remove_phrases
.
It will read the input line by line to check if anything matches the words in keep_phrases
, store all the values in a list, then write them to a new file on separate lines. Unless you need to write the new file line by line as the matches are found, it should be a lot faster this way since everything is written at the same time.
If you don't want it to be case sensitive, change the any(phrase in line
to any(phrase.lower() in line.lower()
.
keep_phrases = ["ACCESSION", "haplogroup", "locality"]
remove_phrases = ['codon', 'TAA']
stuff_to_write = []
with open('C:/a.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases):
do_not_write = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
if not do_not_write:
stuff_to_write.append(line)
with open('C:/b.txt','w') as f:
f.write('\r\n'.join(stuff_to_write))
Upvotes: 1