derpaminontas_1992
derpaminontas_1992

Reputation: 37

Reading fasta files and edit the lines in python

i'm new in programming and its the first time i use python. I'm working on a code that should read a fasta file and delete the header of each sequence. My code to read the file:

def read_fasta(inputfile):
    with open(inputfile,'r') as f:
        file=f.readlines()
    f.close
    return file

fasta_file=read_fasta('SELEX_100_reads.txt')

print(fasta_file)

The output of fasta file looks like that:

['@DBV2SVN1:110:B:7:1101:1456:2092\n', 'CTAAAAAGCGAGTGCGNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNCNNNNNNNNAAACANNAAGGTAAGAAACAAGCACAGATGAGAGC\n', '\n', '+\n', '#####################################################################################################\n', '\n', '@DBV2SVN1:110:B:7:1101:2491:2141\n', 'AAGTGAGCAAACAGAAACATAGTGCGGAGTGGGAAAATGAGACTCAAAAAAAGAGTGTGGGTATTCAGTAGGGGATATTAGGCCACAATACGAAAGAGCAA\n', '\n', '+\n', '#####################################################################################################\n', '\n', '@DBV2SVN1:110:B:7:1101:2924:2130\n'......]

it's a list with header for each sequence. therefore i just want the DNA sequences (CTAAAA or AAGTAAAGCA) of each line as a list. Can anyone help me with that ? Thanks a lot

Cheers, John

Upvotes: 0

Views: 1284

Answers (3)

Daniser
Daniser

Reputation: 162

You can filter the DNA into a new list:

only_dna = fasta_file[1::6]

In [1::6], the 1 is the starting position and the 6 is the "skip interval" in the list.

Upvotes: 0

alani
alani

Reputation: 13079

You can use a regex filter. Assuming that you just want lines that contain only one or more A/C/G/T or N characters (aside from newline and any other trailing whitespace), you could do:

import re

file = list(filter(re.compile("[ACGTN]+\s*$").match, file))

to remove the other lines.

If strings containing N are not meant to be included (I don't know enough biochemistry to know what they represent - not a nucleotide by the looks of things), then obviously exclude the N from the regexp.

Upvotes: 1

Poojan
Poojan

Reputation: 3519

  • From the question i think what you want is all the lines which are DNA sequence.
  • You can filter lines if they contain anythting other than A,C,G,T.
def read_fasta(inputfile):
    with open(inputfile,'r') as f:
        file=f.readlines()
    ret = []
    for line in file:
        if set(strip(line)) == {'A','G','T','C'}:
            ret.append(strip(line)) 
    return ret 

fasta_file=read_fasta('SELEX_100_reads.txt')

print(fasta_file)

Upvotes: 0

Related Questions