Parsing blast output in .xml format

Question

I have a blast output in .xml format, but will not post an example here, since it is huge, unless you really require it. I go specifically to my question. The script below works OK. The only thing is I want to print the hit_def, which has different length in the file followed by a space. How to modify a code to print me hit_def? As you can see if I specify [:8] if will print me 8 characters, but then the length might be 10, 15 etc, how to improve this?

import re
import sys


base = sys.argv[1]
base = base.rstrip('xml')


if fasta_out == True:
    seq_out = open(base+'fasta', 'w')
read_def = set()
with open(sys.argv[1],'rb') as xml:
    for line in xml:
        if re.search('', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = re.sub('', '', line)
            line = re.sub('', '', line)
            query_def = line
        if re.search('', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = re.sub('', '', line)
            line = re.sub('', '', line)
            hit_def = line[:8]
            if fasta_out == True:
                print >> seq_out, query_def+'	'+hit_def+'
'
if fasta_out == True:
    seq_out.close()

This is an example how my hit_def looks,

>MLOC_36586.11 pep:known chromosome:030312v2:1:9883453:9888834:-1 gene:MLOC_36586 transcript:MLOC_36586.11 description:"Uncharacterized protein "
>MLOC_36586.2 pep:known chromosome:030312v2:1:9883444:9888847:-1 gene:MLOC_36586 transcript:MLOC_36586.2 description:"Uncharacterized protein "
>MLOC_51.2 pep:known chromosome:030312v2:1:322147737:322148802:1 gene:MLOC_51 transcript:MLOC_51.2 description:"Predicted protein\x3b Uncharacterized protein "
>MLOC_217.1 pep:known chromosome:030312v2:4:519918111:519919326:1 gene:MLOC_217 transcript:MLOC_217.1 description:"Uncharacterized protein "

Desired hit_def's,

MLOC_36586.11
MLOC_36586.2
MLOC_51.2
MLOC_217.1

Mostly Harmless · Accepted Answer

If you know it's always the first item in the string, you can do something like this:

hit_def = line[:line.index(' ')]

If it isn't necessarily first, you might go for a regex like this:

hit_def = re.findall(r'(MLOC_\d+\.\d+) ',line)[0]

I'm assuming that your hit_defs are all of the form MLOC_XXX.X, but you get the idea.

Parsing blast output in .xml format

Answers (1)

Related Questions