Reputation: 3
** New to Python, sorry **
I'm trying to take a given example file and add only the lines containing "A" or "T" or "G" or "C" (DNA strands) to a list, using a function.
Example file:
gene1
ATGATGATGGCG
gene2
GGCATATC
CGGATACC
gene3
TAGCTAGCCCGC
Under gene2 there are two separate lines I need to concatenate using my function.
Here's what I have completed for my function:
def create(filename):
"""
Purpose: Creates and returns a data structure (list) to store data.
:param filename: The given file
Post-conditions: (none)
:return: List of data.
"""
new_list = []
f = open(filename, 'r')
for i in f:
if not('A' or 'T' or 'G' or 'C') in i:
new_list = new_list #Added this so nothing happens but loop cont.
else:
new_list.append(i.strip())
f.close()
return new_list
I need to somehow find parts of the file where there are two consecutive lines of DNA ("GTCA") and join them before adding them to my list.
If done correctly the output when printed should read:
['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']
Thanks in advance!
Upvotes: 0
Views: 543
Reputation: 322
Regexes to the rescue!
import re
def create(filename):
dna_regex = re.compile(r'[ATGC]+')
with open(filename, 'r') as f:
return dna_regex.findall(f.read().replace('\n', '')))
new_list = []
new_list += create("gene_file.txt")
It's important to note that this implementation in particular might get a false positive if the gene
lines contains an A, T, G, or C.
What this does is it takes in the whole file, removes the newlines, and then finds all of the sequences containing only A, T, G, or C and returns them.
Upvotes: 1
Reputation: 23770
If we can assume that each DNA section is prefixed by one line, we can take advantage of the takewhile
function that'll group the DNA lines:
from itertools import takewhile
DNA_CHARS = ('A', 'T', 'G', 'C')
lines = ['gene1', 'ATGATGATGGCG', 'gene2', 'GGCATATC', 'CGGATACC', 'gene3', 'TAGCTAGCCCGC']
input_lines = iter(lines[1:])
dna_lines = []
while True:
dna_line = ''.join(takewhile(lambda l: any(dna_char in l for dna_char in DNA_CHARS),
input_lines))
if not dna_line:
break
dna_lines.append(dna_line)
Upvotes: 0
Reputation: 10860
You can use set
s to check if a line is a DNA line, i.e. consists of the letters ACGT only:
with open(filename) as f:
new_list = []
concat = False
for line in f:
if set(line.strip()) == {'A', 'C', 'G', 'T'}:
if concat:
new_list[-1] += line.strip()
else:
new_list.append(line.strip())
concat = True
else:
concat = False
# ['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']
Upvotes: 1