requests.iter_content() thinks file is complete but it's not

Question

This question differs from others that I have seen regarding requests.iter_content() in that requests seems to think that it has successfully reached the end of the file I am iterating over. In reality the file has been truncated and is incomplete. The file I am trying to process is a 17gb gzip to be enriched and stored inside a database. A browser can download the file just fine.

Why is this file not downloading all the way and why doesn't requests throw an exception if it can't download the whole file?

Source Code: (updated - see edit)

Here is my "reader" function - it is part of a multiprocessing script that is working with the data:

def patch_urllib3():
    """Set urllib3's enforce_content_length to True by default."""
    previous_init = urllib3.HTTPResponse.__init__
    def new_init(self, *args, **kwargs):
        previous_init(self, *args, enforce_content_length = True, **kwargs)
    urllib3.HTTPResponse.__init__ = new_init

def reader(target_url, data_queue, coordinator_queue, chunk_size):
    patch_urllib3() 
    #Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
    decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
    #Stream this file in as a request - pull the content in just a little at a time
    #This should remain open until completion.
    with requests.get (target_url, stream=True) as remote_file:
        last_line="" #start this blank
        #Chunk size can be adjusted to test performance
        for data_chunk in remote_file.iter_content(chunk_size=4096):
            #Decompress the current chunk
            decompressed_chunk=decompressor.decompress(dns_chunk)
            #These characters are in "byte" format and need to be decoded to utf-8
            decompressed_chunk=decompressed_chunk.decode()
            #Append the "last line" to add any fragments from the last chunk - it is blank the first time around
            #This basically sticks line fragments from the last chunk onto the front of the current chunk.
            decompressed_chunk=last_line+decompressed_chunk
            #Run a split here; this is likely a costly step...
            data_chunk=list(decompressed_chunk.splitlines())
            #Pop the last line off the chunk since it isn't likely to be complete
            #We'll add it to the front of the next chunk
            last_line=dns_chunk.pop()
            data_queue.put(data_chunk)
            coordinator_queue.put('CHUNK_READ')
    #File is fully read so send the last line and let the reader exit:
    print("Sending last line.")
    data_queue.put(last_line)
    #Notify coordinator process of task completion
    coordinator_queue.put('READ_DONE')

Additional Notes:

When the code runs it ends up pulling less data than what is contained in the full file.
- This leaves me with a fragmented last_line at the end (which then breaks my data processing function).
- I was under the impression that the with clause and stream=True argument would help prevent the session from closing prematurely.
The full file should contain 1.8 billion rows
- The script typically stops between 1 and 2 million rows.
There are not any errors from the requests library when the request "completes".
- Upon "completion" the console prints Sending last line. per my code sample
- This confirms that the with clause has "successfully" completed.
Error trapping in my other processes indicates a line fragment being sent since the only line in the final chunk is the residual data from the last request iteration.

Update:

I have found a blog post that speaks directly to this issue. It appears that by default the requests package does not set the urllib3 enforce_content_length option to true. There is no way to do this directly through requests and so one must "patch" the urllib3 options prior to setting up the requests object. Note the def patch_urllib3() function listed in this gituhub issue. I have updated my source code to include this function and now I receive the following errors when the read stops prematurely:

urllib3.exceptions.IncompleteRead: IncompleteRead(52079993 bytes read, 18453799085 more expected)

urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(52079993 bytes read, 18453799085 more expected)', IncompleteRead(52079993 bytes read, 18453799085 more expected))

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(52079993 bytes read, 18453799085 more expected)', IncompleteRead(52079993 bytes read, 18453799085 more expected))

I am still working to see if there is a solution to these errors or some ability to resume the file download. I've tried sending another request and including range headers that start where the download ended previously but it seems that the interruption to the initial request is irrecoverable.

requests.iter_content() thinks file is complete but it's not

Answers (0)

Related Questions

requests.iter_content() thinks file is complete but it&#39;s not

Answers (0)

Related Questions

requests.iter_content() thinks file is complete but it's not