user2168818
user2168818

Reputation: 11

Reading multiple blast files (biopython)

I'm trying to read a list of XML files generated through multiple sequence submissions to the NCBI blast website. From each file, I want to print certain lines of information. The files I want to read are all given the suffix "_recombination.xml".

for file in glob.glob("*_recombination.xml"):
    result_handle= open(file)
    blast_record=NCBIXML.read(result_handle)
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            print "*****Alignment****"
            print "sequence:", alignment.title
            print "length:", alignment.length
            print "e-value:", hsp.expect
            print hsp.query
            print hsp.match
            print hsp.sbjct

The script first finds all the files with the "_recombination.xml" suffix, I then want it to read each file, and print certain lines (this is pretty much a straight copy from the BioPython cook book), which it seems to do. But I get the following error:

Traceback (most recent call last):
File "Scripts/blast_test.py", line 202, in <module>
blast_record=NCBIXML.read(result_handle)
File "/Library/Python/2.7/site-packages/Bio/Blast/NCBIXML.py", line 576, in read
first = iterator.next()
File "/Library/Python/2.7/site-packages/Bio/Blast/NCBIXML.py", line 643, in parse
expat_parser.Parse("", True) # End of XML record
xml.parsers.expat.ExpatError: no element found: line 3106, column 7594 

I'm not really sure what the problem is. I'm not sure if it's trying to loop back through the files it's already read- for example, closing the files seems to help:

for file in glob.glob("*_recombination.xml"):
    result_handle= open(file)
    blast_record=NCBIXML.read(result_handle)
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            print "*****Alignment****"
            print "sequence:", alignment.title
            print "length:", alignment.length
            print "e-value:", hsp.expect
            print hsp.query
            print hsp.match
            print hsp.sbjct
    result_handle.close()
    blast_record.close()

But it also gives me another error:

Traceback (most recent call last): 
File "Scripts/blast_test.py", line 213, in <module> blast_record.close() 
AttributeError: 'Blast' object has no attribute 'close'

Upvotes: 1

Views: 1556

Answers (2)

Cat
Cat

Reputation: 1

I would add a comment to Biogeek answer but I cannot (not enough reputation yet). In deed he is right, you should use

NCBIXML.parse(open(input_xml))

instead of NCBIXML.read(open(input_xml)) because you are "trying to read a list of XML files" and for lists of XML files you need to parse and not read. Did it fix your problem?

Upvotes: 0

dpflieger
dpflieger

Reputation: 41

I usually use the parse method instead of read, maybe it could help you:

for blast_record in NCBIXML.parse(open(input_xml)):
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            print "*****Alignment****"
            print "sequence:", alignment.title
            print "length:", alignment.length
            print "e-value:", hsp.expect
            print hsp.query
            print hsp.match
            print hsp.sbjct

and be sure that your xmls are generated with -outfmt 5 in your query blast

Upvotes: 3

Related Questions