How to separate each blast result using python regex and store it in a list for further analysis

Question

I am working on a set of biological sequences which involves the use of ncbi-blast. I need some help with processing the output file using python regex. The text result containing multiple outputs (sequence analysis results) looks something like this,

Query= lcl|TRINITY_DN2888_c0_g2_i1

Length=1394 Score E Sequences producing significant alignments:
(Bits) Value

sp|Q9S775|PKL_ARATH

CHD3-type chromatin-remodeling factor PICKLE... 1640 0.0

sp|Q9S775|PKL_ARATH CHD3-type chromatin-remodeling factor PICKLE OS=Arabidopsis thaliana OX=3702 GN=PKL PE=1 SV=1 Length=1384

Score = 1640 bits (4248), Expect = 0.0, Method: Compositional matrix adjust. Identities = 830/1348 (62%), Positives = 1036/1348 (77%), Gaps = 53/1348 (4%)

Query 1
MSSLVERLRVRSERRPLYTDDDSDDDLYAARGGSESKQEERPPERIVRDDAKNDTCKTCG 60 MSSLVERLR+RS+R+P+Y DDSDDD + + +Q E IVR DAK + C+ CG Sbjct 1
MSSLVERLRIRSDRKPVYNLDDSDDDDFVPKKDRTFEQ----VEAIVRTDAKENACQACG 56

Lambda K H a alpha 0.317 0.134 0.389 0.792 4.96

Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 160862965056

Query= lcl|TRINITY_DN2855_c0_g1_i1

Length=145 ........................................ ................................................... ...................................................

I want to extract the information starting from "Query= lcl|TRINITY_DN2888_c0_g2_i1" to the next query "Query=lcl|TRINITY_DN2855_c0_g1_i1" and store it in a python list for further analysis (since the entire file contains few thousands of query results). Is there a python regex code that can do this action?

Here is my code:

#!/user/bin/python3
file=open("path/file_name","r+")
import re
inter=file.read()
lst=[]
lst=re.findall(r'>(.*)>',inter,re.DOTALL)
print(lst)
for x in lst:
    print(x)

I get the wrong output since the code prints the entire information present in file (thousands) rather than picking up one result at a time.

Thank you

Vince · Accepted Answer

To get the result you want, edit the line with the re.findall() method call to the following using re.split():

lst=re.split(r'(>Query\=.*)?',inter,re.DOTALL)

See this for more info on re.split():

https://docs.python.org/2/library/re.html

Also, you may want to consider using the now deprecated BLAST parser in biopython:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

The plain text BLAST parser is located in Bio.Blast.NCBIStandalone.

As with the XML parser, we need to have a handle object that we can pass to the parser. The handle must implement the readline() method and do this properly. The common ways to get such a handle are to either use the provided blastall or blastpgp functions to run the local blast, or to run a local blast via the command line, and then do something like the following:

result_handle = open("my_file_of_blast_output.txt")

Well, now that we’ve got a handle (which we’ll call result_handle), we are ready to parse it. This can be done with the following code:

>>> from Bio.Blast import NCBIStandalone
>>> blast_parser = NCBIStandalone.BlastParser()
>>> blast_record = blast_parser.parse(result_handle)

This will parse the BLAST report into a Blast Record class (either a Blast or a PSIBlast record, depending on what you are parsing) so that you can extract the information from it. In our case, let’s just print out a quick summary of all of the alignments greater than some threshold value.

>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments: 
...     for hsp in alignment.hsps: 
...         if hsp.expect < E_VALUE_THRESH: 
...             print('****Alignment****') 
...             print('sequence:', alignment.title) 
...             print('length:', alignment.length)
...             print('e value:', hsp.expect) 
...             print(hsp.query[0:75] + '...') 
...             print(hsp.match[0:75] + '...') 
...             print(hsp.sbjct[0:75] + '...')

If you also read the section 7.3 on parsing BLAST XML output, you’ll notice that the above code is identical to what is found in that section. Once you parse something into a record class you can deal with it independent of the format of the original BLAST info you were parsing. Pretty snazzy!

How to separate each blast result using python regex and store it in a list for further analysis

Answers (2)

Related Questions