Reputation: 401
I am trying to fetch fasta sequences for accession numbers from NCBI using Biopython. Usually the sequences were successfully downloaded. But once in a while I get the below error:
http.client.IncompleteRead: IncompleteRead(61808640 bytes read)
I have searched the answers How to handle IncompleteRead: in python
I have tried top answer https://stackoverflow.com/a/14442358/4037275. It is working. However, the problem is, it downloads partial sequences. Is there any other way. Can anyone point me in right direction?
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "my email id"
def extract_fasta_sequence(NC_accession):
"This takes the NC_accession number and fetches their fasta sequence"
print("Extracting the fasta sequence for the NC_accession:", NC_accession)
handle = Entrez.efetch(db="nucleotide", id=NC_accession, rettype="fasta", retmode="text")
record = handle.read()
Upvotes: 2
Views: 1512
Reputation: 195
I think the best way to solve this problem is to use the base URL of NCBI using the requests package. In this way, you can set a timeout for the host's response easily.
e.g Some base URLs:
You can find complete information on the E-utilities guide website of NCBI.
This is so convenient, as some errors occur does NCBI host not responding and has to wait for a long time without any response. But if re-get maybe gains a response. So you can combine the try/except
statement to build your own retrieve data code.
I have an EC number and I want to use the ESearch to find 50 related papers on the Pubmed database from 2015 to now.
import requests
import re
ec_num = '1.1.1.6'
esearch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
payload = {'db':'pubmed', 'term':f"{ec_num}[EC/RN Number]",
'retmax':50, 'sort':"pub_date", 'usehistory':"y",
'datetype':'pdat', 'mindate':'2015', 'maxdate':'3000'}
handle = requests.get(esearch_url,params=payload, timeout=20) #Set time out is 20s
records = handle.text
## Retrieve query_key and wed_env for the next tool (e.g ESumary, Elink, EFetch)
query_key = re.search(r'<QueryKey>(\d+)</QueryKey>', records).group(1)
wed_env = re.search(r'<WebEnv>(\w+)</WebEnv>', records).group(1)
## Retrieve the number of related articles
counts = re.search(r"<Count>(\d+)</Count>", records).group(1)
#print(counts)
## Retrieve the Pubmed Id of related articles
pub_ids = re.findall(r"<Id>(\d+)</Id>", records)
#print(pub_ids)
Upvotes: 0
Reputation: 1614
You will need to add a try/except to catch common network errors like this. Note that exception httplib.IncompleteRead is a subclass of the more general HTTPException, see: https://docs.python.org/3/library/http.client.html#http.client.IncompleteRead
e.g. http://lists.open-bio.org/pipermail/biopython/2011-October/013735.html
See also https://github.com/biopython/biopython/pull/590 would catch some of the other errors you can get with the NCBI Entrez API (errors the NCBI ought to deal with but don't).
Upvotes: 2