Abraham Ahmad
Abraham Ahmad

Reputation: 31

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

In python this code, where I directly call the function SeqIO.parse() , runs fine:

from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)

for asq in SeqIO.parse("a.fasta", "fasta"):
    print("Q")

But this, where I first store the output of SeqIO.parse() in a variable(?) called a, and then try to use it in my loop, it doesn't run:

from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)

for asq in a:
    print("Q")

Is this because a the output from the function || SeqIO.parse("a.fasta", "fasta") || is being stored in 'a' differently from when I directly call it? What exactly is the identity of 'a' here. Is it a variable? Is it an object? What does the function actually return?

Upvotes: 3

Views: 1081

Answers (2)

Susheel Busi
Susheel Busi

Reputation: 163

I have a similar issue that the parsed sequence file doesn't work inside a for-loop. Code below:

genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['anonymous_gsa_id', 'genome_id'])
# sample_f = SeqIO.parse('SAMPLE.fasta', 'fasta')

for i, r in genomes_l.iterrows():
    genome_name = r['anonymous_gsa_id']
    genome_ids = r['genome_id'].split(',')
    genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
    with open(f'out_dir/{genome_name}_contigs.fasta', 'w') as handle:
        SeqIO.write(genome_contigs, handle, 'fasta')

Originally, I read the file in as sample_f, however inside the loop it wouldn't work. Would appreciate any help to avoid having to read the file over and over again. Specifically the below line:

genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]

Thank you!

Upvotes: 0

Chris_Rands
Chris_Rands

Reputation: 41168

SeqIO.parse() returns a normal python generator. This part of the Biopython module is written in pure python:

>>> from Bio import SeqIO
>>> a = SeqIO.parse("a.fasta", "fasta")
>>> type(a)
<class 'generator'>

Once a generator is iterated over it is exhausted as you discovered. You can't rewind a generator but you can store the contents in a list or dict if you don't mind putting it all in memory (useful if you need random access). You can use SeqIO.to_dict(a) to store in a dictionary with the record ids as the keys and sequences as the values. Simply re-building the generator calling SeqIO.parse() again will avoid dumping the file contents into memory of course.

Upvotes: 5

Related Questions