Mavershang
Mavershang

Reputation: 1278

How to fetch all records using NCBI Batch Entrez

I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.

I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:

  1. The initial file was splitted into multiple sub-files, each containing 4000 lines. But it seems Batch Entrez has some size limitation on the returned file. For example: if the first 1000 accessions all have tens of thousands lines which reach the size limitation, then the rest 3000 accessions will be rejected and won't be searched.

One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.

So I am just wondering if there is any other solution, or any code could be used.

Thanks in advance

Upvotes: 0

Views: 2090

Answers (1)

Hernán
Hernán

Reputation: 1749

Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk

| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
    genBankRecordsFrom: 'nuccore'
    format: #setModeXML
    uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
    collect: [: e | BioParser 
                       tokenizeNcbiXmlBlast: e contents 
                       nodes: #('GBAuthor' 'GBSeq_definition') ]

To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.

The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.

The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.

Upvotes: 1

Related Questions