Reputation: 2040
I have an URL pointing to a binary file which I need to download after checking its size, because the download should only be (re-)executed if the local file size differs from the remote file size.
This is how it works with wget
(anonymized host names and IPs):
$ wget <URL>
--2020-02-17 11:09:18-- <URL>
Resolving <URL> (<host>)... <IP>
Connecting to <host> (<host>)|<ip>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31581872 (30M) [application/x-gzip]
Saving to: ‘[...]’
This also works fine with the --continue
flag in order to resume a download, including skipping if the file was completely downloaded earlier.
I can do the same with curl
, the content-length
is also present:
$ curl -I <url>
HTTP/2 200
date: Mon, 17 Feb 2020 13:11:55 GMT
server: Apache/2.4.25 (Debian)
strict-transport-security: max-age=15768000
last-modified: Fri, 14 Feb 2020 15:42:29 GMT
etag: "[...]"
accept-ranges: bytes
content-length: 31581872
vary: Accept-Encoding
content-type: application/x-gzip
In Python, I try to implement the same logic by checking the Content-length
header using the requests library:
with requests.get(url, stream=True) as response:
total_size = int(response.headers.get("Content-length"))
if not response.ok:
logger.error(
f"Error {response.status_code} when downloading file from {url}"
)
elif os.path.exists(file) and os.stat(file).st_size == total_size:
logger.info(f"File '{file}' already exists, skipping download.")
else:
[...] # download file
It turns out that the Content-length
header is never present, i.e. gets a None
value here. I know that this should be worked around by passing a default value to the get()
call, but for the purpose of debugging, this example consequently triggers an exception:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
I can confirm manually that the Content-length
header is not there:
requests.get(url, stream=True).headers
{'Date': '[...]', 'Server': '[...]', 'Strict-Transport-Security': '[...]', 'Upgrade': '[...]', 'Connection': 'Upgrade, Keep-Alive', 'Last-Modified': '[...]', 'ETag': ''[...]'', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=15, max=100', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/x-gzip'}
This logic works fine though for other URLs, i.e. I do get the Content-length
header.
When using requests.head(url)
(omitting the stream=True
), I get the same headers except for Transfer-Encoding
.
I understand that a server does not have to send a Content-length
header.
However, wget
and curl
clearly do get that header. What do they do differently from my Python implementation?
Upvotes: 1
Views: 957
Reputation: 2040
Not really an answer to the question about the missing Content-length
header, but a solution to the underlying problem:
Instead of checking the local file size vs the content length of the remote, I have ended up checking the Last-modified
header and compare that to the mtime
of the local file. This is also safer in (the unlikely) case that the remote file is updated, but still has the exact same size.
Upvotes: 1