Reputation: 1527
I am running a script that is scraping several hundred pages on a site but recently I have been running into IncompleteRead()
errors. My understanding is from looking on stackoverflow is that they can happen for any number of unknown reasons.
The error is caused randomly by the Request()
function I believe from searching around:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
3.5.2.3
2.1.3.15
2.5.1.72
1.5.1.2
6.1.1.9
3.2.2.27
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
chunk_left = self._get_chunk_left()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
raise IncompleteRead(b'')
IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-20-82f1876d3006>", line 5, in <module>
html = urlopen(url).read()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
return self._readall_chunked()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
raise IncompleteRead(b''.join(value))
IncompleteRead: IncompleteRead(1772944 bytes read)
The error happens randomly, as in not always the same url causes it, with https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27
causing this specific one.
Some solutions seems to introduce a try
clause but within the except
they store the partial data (I think). Why is the the case, why not just resubmit the request?
If so how would I just re run the request as doing that normally seems to solve the issue. Beyond this I have no idea how I can fix the problem.
Upvotes: 2
Views: 6032
Reputation: 61
I have faced with same issue and found this solution
After some little changes the code looks like here:
from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...
def patch_http_response_read(func):
def inner(args):
try:
return func(args)
except IncompleteRead as e:
return e.partial
return inner
HTTPResponse.read = patch_http_response_read(HTTPResponse.read)
try:
response = urlopen(my_url)
result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
print('URL Error Reason: ', e.reason)
except HTTPError as e:
print('HTTP Error code: ', e.code)
I'm not sure that it is a better way. But it works in my case. I'll be happy if this advice will be useful to you or help to you to found something different good solution. Happy coding!
Upvotes: 2
Reputation: 148900
The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.
As you have said, this can happen for numerous causes, and the occurence is at random. So:
The best you can do is to catch the error and retry, after an optional delay.
For example:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
# optionaly add a delay here
soup = BeautifulSoup(html, 'html.parser')
Upvotes: 4