Austin Richardson
Austin Richardson

Reputation: 8437

BioPython: Skipping over bad GIDs with Entrez.esummary/Entrez.read

Sorry about the odd title.

I am using eSearch & eSummary to go from

Accession Number --> gID --> TaxID

Assume that 'accessions' is a list of 20 accession numbers (I do 20 at a time because that's the maximum that NCBI will allow).

I do:

handle = Entrez.esearch(db="nucleotide", rettype="xml", term=accessions)
record = Entrez.read(handle)
gids = ",".join(record[u'IdList'])

This gives me 20 correspoding GIDs from those 20 accession numbers.

Followed by:

handle = Entrez.esummary(db="nucleotide", id=gids)
record = Entrez.read(handle)

Which gives me this error because one of the GIDs in gids has been removed from NCBI:

File ".../biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", line 191, in endElement value = IntegerElement(value)
ValueError: invalid literal for int() with base 10: ''

I could do try:, except: except that would skip the other 19 GIDs which are okay.

My question is:

How do I read 20 records at a time with Entrez.read and skip over the ones that are missing without sacrificing the other 20? I could do one at a time but that would be incredibly slow (I have 300,000 accession numbers, and NCBI only allows you to do 3 queries per second but in reality it's more like 1 query per second).

Upvotes: 3

Views: 467

Answers (2)

Austin Richardson
Austin Richardson

Reputation: 8437

I sent a message out to the BioPython mailing list.Apparently it's a bug & they're working on it.

Upvotes: 3

John La Rooy
John La Rooy

Reputation: 304403

I'd have a look at Parser.py and see what is being parsed. It looks like you are getting a result from the NCBI ok, but the format of one record is tripping up the parser.

It may be possible to subclass/monkeypatch the parser to get it past the exception.

Upvotes: 0

Related Questions