Reputation: 2607
This question differs from others that I have seen regarding requests.iter_content()
in that requests
seems to think that it has successfully reached the end of the file I am iterating over. In reality the file has been truncated and is incomplete. The file I am trying to process is a 17gb gzip to be enriched and stored inside a database. A browser can download the file just fine.
Why is this file not downloading all the way and why doesn't requests
throw an exception if it can't download the whole file?
Source Code: (updated - see edit)
Here is my "reader" function - it is part of a multiprocessing script that is working with the data:
def patch_urllib3():
"""Set urllib3's enforce_content_length to True by default."""
previous_init = urllib3.HTTPResponse.__init__
def new_init(self, *args, **kwargs):
previous_init(self, *args, enforce_content_length = True, **kwargs)
urllib3.HTTPResponse.__init__ = new_init
def reader(target_url, data_queue, coordinator_queue, chunk_size):
patch_urllib3()
#Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
#Stream this file in as a request - pull the content in just a little at a time
#This should remain open until completion.
with requests.get (target_url, stream=True) as remote_file:
last_line="" #start this blank
#Chunk size can be adjusted to test performance
for data_chunk in remote_file.iter_content(chunk_size=4096):
#Decompress the current chunk
decompressed_chunk=decompressor.decompress(dns_chunk)
#These characters are in "byte" format and need to be decoded to utf-8
decompressed_chunk=decompressed_chunk.decode()
#Append the "last line" to add any fragments from the last chunk - it is blank the first time around
#This basically sticks line fragments from the last chunk onto the front of the current chunk.
decompressed_chunk=last_line+decompressed_chunk
#Run a split here; this is likely a costly step...
data_chunk=list(decompressed_chunk.splitlines())
#Pop the last line off the chunk since it isn't likely to be complete
#We'll add it to the front of the next chunk
last_line=dns_chunk.pop()
data_queue.put(data_chunk)
coordinator_queue.put('CHUNK_READ')
#File is fully read so send the last line and let the reader exit:
print("Sending last line.")
data_queue.put(last_line)
#Notify coordinator process of task completion
coordinator_queue.put('READ_DONE')
Additional Notes:
last_line
at the end (which then breaks my data processing function).with
clause and stream=True
argument would help prevent the session from closing prematurely.requests
library when the request "completes".
Sending last line.
per my code samplewith
clause has "successfully" completed.Update:
I have found a blog post that speaks directly to this issue. It appears that by default the requests
package does not set the urllib3
enforce_content_length
option to true. There is no way to do this directly through requests
and so one must "patch" the urllib3
options prior to setting up the requests
object. Note the def patch_urllib3()
function listed in this gituhub issue. I have updated my source code to include this function and now I receive the following errors when the read stops prematurely:
urllib3.exceptions.IncompleteRead: IncompleteRead(52079993 bytes read, 18453799085 more expected)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(52079993 bytes read, 18453799085 more expected)', IncompleteRead(52079993 bytes read, 18453799085 more expected))
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(52079993 bytes read, 18453799085 more expected)', IncompleteRead(52079993 bytes read, 18453799085 more expected))
I am still working to see if there is a solution to these errors or some ability to resume the file download. I've tried sending another request and including range
headers that start where the download ended previously but it seems that the interruption to the initial request is irrecoverable.
Upvotes: 9
Views: 2057