Reputation: 23

Extracting sequences in Python

I have a file that looks like this:

>sequence_name_16hj51
CAACCTTGGCCAT
>sequence_name_158ghni52
AATTGGCCTTGGA
>sequence_name_468rth
AAGGTTCCA

I would like to obtain this: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

I have a list with all the sequence names titled title_finder. When I try to use:

for i in range(0,len(title_finder)):
    seq = seq.split(title_finder[i])
    print seq

I get this traceback:

Traceback (most recent call last):
  File "D:/Desktop/Python/consensus new.py", line 23, in <module>
    seq = seq.split(title_finder[i])
AttributeError: 'list' object has no attribute 'split'

Can somebody help me out?

EDIT: Sometimes some sequences span multiple lines and so I get more than one string when I do it with a for loop.

Upvotes: 1

Answers (4)

LetzerWille

Reputation: 5658

line = ""

import re

with open('test') as f:
  lines = [line.rstrip()  for line in f if not re.search('sequence_name', line)]

print(lines)

['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

Upvotes: 0

Kasravnd

Reputation: 107287

You are trying to split a list which gave you that AttributeError, instead of that you can read your file line and check if the line doesn't starts with > then preserve it.

With open('file_nam') as f:
    my_patterns=[line.rstrip() for line in f in not line.startswith('>')]

Also as an alternative and pythonic way if you are sure that the patterns are in odd lines you can use itertools.islice to slice your file object :

from itertools import islice
With open('file_nam') as f:
     my_my_patterns=list(islice(f,1,None,2))

And note that if you just want to loop over your patterns you don't need to convert the result of islice to list you can simply iterate over your iterator.

Upvotes: 1

BioGeek

Reputation: 22827

If you're doing bioinformatics, you should really consider installing BioPython.

from Bio import SeqIO
with open('your_file.fasta') as f:
    return [str(record.seq) for record in SeqIO.parse(f, "fasta")]

If you want to do it in pure Python, then this wil work:

with open('your_file.fasta') as f:
    print [line.rstrip() for line in f if not line.startswith('>')]

Upvotes: 4

Iman Mirzadeh

Reputation: 13550

assume your file is seq.in, then you can do this to get your list:

In [17]: with open ('seq.in','r') as f:
          extracted_list=[line[:-1] for line in f if line[0]!='>']

In [18]: extracted_list
Out[18]: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

Upvotes: 0

Extracting sequences in Python

Answers (4)

Related Questions