Carsten
Carsten

Reputation: 2040

Content-length available in Curl, Wget, but not in Python Requests

I have an URL pointing to a binary file which I need to download after checking its size, because the download should only be (re-)executed if the local file size differs from the remote file size.

This is how it works with wget (anonymized host names and IPs):

$ wget <URL>
--2020-02-17 11:09:18--  <URL>
Resolving <URL> (<host>)... <IP>
Connecting to <host> (<host>)|<ip>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31581872 (30M) [application/x-gzip]
Saving to: ‘[...]’

This also works fine with the --continue flag in order to resume a download, including skipping if the file was completely downloaded earlier.

I can do the same with curl, the content-length is also present:

$ curl -I <url>
HTTP/2 200 
date: Mon, 17 Feb 2020 13:11:55 GMT
server: Apache/2.4.25 (Debian)
strict-transport-security: max-age=15768000
last-modified: Fri, 14 Feb 2020 15:42:29 GMT
etag: "[...]"
accept-ranges: bytes
content-length: 31581872
vary: Accept-Encoding
content-type: application/x-gzip

In Python, I try to implement the same logic by checking the Content-length header using the requests library:

        with requests.get(url, stream=True) as response:
            total_size = int(response.headers.get("Content-length"))

            if not response.ok:
                logger.error(
                    f"Error {response.status_code} when downloading file from {url}"
                )
            elif os.path.exists(file) and os.stat(file).st_size == total_size:
                logger.info(f"File '{file}' already exists, skipping download.")
            else:
                [...] # download file

It turns out that the Content-length header is never present, i.e. gets a None value here. I know that this should be worked around by passing a default value to the get() call, but for the purpose of debugging, this example consequently triggers an exception:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' 

I can confirm manually that the Content-length header is not there:

requests.get(url, stream=True).headers
{'Date': '[...]', 'Server': '[...]', 'Strict-Transport-Security': '[...]', 'Upgrade': '[...]', 'Connection': 'Upgrade, Keep-Alive', 'Last-Modified': '[...]', 'ETag': ''[...]'', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=15, max=100', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/x-gzip'}

This logic works fine though for other URLs, i.e. I do get the Content-length header.

When using requests.head(url) (omitting the stream=True), I get the same headers except for Transfer-Encoding.

I understand that a server does not have to send a Content-length header. However, wget and curl clearly do get that header. What do they do differently from my Python implementation?

Upvotes: 1

Views: 957

Answers (1)

Carsten
Carsten

Reputation: 2040

Not really an answer to the question about the missing Content-length header, but a solution to the underlying problem:

Instead of checking the local file size vs the content length of the remote, I have ended up checking the Last-modified header and compare that to the mtime of the local file. This is also safer in (the unlikely) case that the remote file is updated, but still has the exact same size.

Upvotes: 1

Related Questions