Reputation: 1278
I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.
I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:
One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.
So I am just wondering if there is any other solution, or any code could be used.
Thanks in advance
Upvotes: 0
Views: 2090
Reputation: 1749
Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk
| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
genBankRecordsFrom: 'nuccore'
format: #setModeXML
uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
collect: [: e | BioParser
tokenizeNcbiXmlBlast: e contents
nodes: #('GBAuthor' 'GBSeq_definition') ]
To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.
The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.
The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.
Upvotes: 1