Reputation: 1151
I have a blast output in .xml format, but will not post an example here, since it is huge, unless you really require it. I go specifically to my question. The script below works OK. The only thing is I want to print the hit_def, which has different length in the file followed by a space. How to modify a code to print me hit_def? As you can see if I specify [:8] if will print me 8 characters, but then the length might be 10, 15 etc, how to improve this?
import re
import sys
base = sys.argv[1]
base = base.rstrip('xml')
if fasta_out == True:
seq_out = open(base+'fasta', 'w')
read_def = set()
with open(sys.argv[1],'rb') as xml:
for line in xml:
if re.search('<Iteration_query-def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Iteration_query-def>', '', line)
line = re.sub('</Iteration_query-def>', '', line)
query_def = line
if re.search('<Hit_def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Hit_def>', '', line)
line = re.sub('</Hit_def>', '', line)
hit_def = line[:8]
if fasta_out == True:
print >> seq_out, query_def+'\t'+hit_def+'\n'
if fasta_out == True:
seq_out.close()
This is an example how my hit_def looks,
>MLOC_36586.11 pep:known chromosome:030312v2:1:9883453:9888834:-1 gene:MLOC_36586 transcript:MLOC_36586.11 description:"Uncharacterized protein "
>MLOC_36586.2 pep:known chromosome:030312v2:1:9883444:9888847:-1 gene:MLOC_36586 transcript:MLOC_36586.2 description:"Uncharacterized protein "
>MLOC_51.2 pep:known chromosome:030312v2:1:322147737:322148802:1 gene:MLOC_51 transcript:MLOC_51.2 description:"Predicted protein\x3b Uncharacterized protein "
>MLOC_217.1 pep:known chromosome:030312v2:4:519918111:519919326:1 gene:MLOC_217 transcript:MLOC_217.1 description:"Uncharacterized protein "
Desired hit_def's,
MLOC_36586.11
MLOC_36586.2
MLOC_51.2
MLOC_217.1
Upvotes: 0
Views: 366
Reputation: 917
If you know it's always the first item in the string, you can do something like this:
hit_def = line[:line.index(' ')]
If it isn't necessarily first, you might go for a regex like this:
hit_def = re.findall(r'(MLOC_\d+\.\d+) ',line)[0]
I'm assuming that your hit_defs
are all of the form MLOC_XXX.X
, but you get the idea.
Upvotes: 1