Reputation: 37
i'm new in programming and its the first time i use python. I'm working on a code that should read a fasta file and delete the header of each sequence. My code to read the file:
def read_fasta(inputfile):
with open(inputfile,'r') as f:
file=f.readlines()
f.close
return file
fasta_file=read_fasta('SELEX_100_reads.txt')
print(fasta_file)
The output of fasta file looks like that:
['@DBV2SVN1:110:B:7:1101:1456:2092\n', 'CTAAAAAGCGAGTGCGNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNCNNNNNNNNAAACANNAAGGTAAGAAACAAGCACAGATGAGAGC\n', '\n', '+\n', '#####################################################################################################\n', '\n', '@DBV2SVN1:110:B:7:1101:2491:2141\n', 'AAGTGAGCAAACAGAAACATAGTGCGGAGTGGGAAAATGAGACTCAAAAAAAGAGTGTGGGTATTCAGTAGGGGATATTAGGCCACAATACGAAAGAGCAA\n', '\n', '+\n', '#####################################################################################################\n', '\n', '@DBV2SVN1:110:B:7:1101:2924:2130\n'......]
it's a list with header for each sequence. therefore i just want the DNA sequences (CTAAAA or AAGTAAAGCA) of each line as a list. Can anyone help me with that ? Thanks a lot
Cheers, John
Upvotes: 0
Views: 1284
Reputation: 162
You can filter the DNA into a new list:
only_dna = fasta_file[1::6]
In [1::6], the 1 is the starting position and the 6 is the "skip interval" in the list.
Upvotes: 0
Reputation: 13079
You can use a regex filter. Assuming that you just want lines that contain only one or more A/C/G/T or N characters (aside from newline and any other trailing whitespace), you could do:
import re
file = list(filter(re.compile("[ACGTN]+\s*$").match, file))
to remove the other lines.
If strings containing N are not meant to be included (I don't know enough biochemistry to know what they represent - not a nucleotide by the looks of things), then obviously exclude the N
from the regexp.
Upvotes: 1
Reputation: 3519
def read_fasta(inputfile):
with open(inputfile,'r') as f:
file=f.readlines()
ret = []
for line in file:
if set(strip(line)) == {'A','G','T','C'}:
ret.append(strip(line))
return ret
fasta_file=read_fasta('SELEX_100_reads.txt')
print(fasta_file)
Upvotes: 0