Ion
Ion

Reputation: 1043

Python requests downloads HTML if file is not found

I am downloading a list of remote files. My code looks like the following:

try:
    r = requests.get(url, stream=True, verify=False)
    total_length = int(r.headers['Content-Length'])

    if total_length:
        with open(file_name, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()

except (requests.RequestException, StandardError):
    pass

My problem is that requests downloads plain HTML for files that do not exist (for example the 404 page, or other similar in nature HTML pages). Is there a way to circumvent this? Any header to check like Content-Type perhaps?

Solution:

I used the r.raise_for_status() function call as per the accepted answer and also added an extra check for Content-Type like:

if r.headers['Content-Type'].split('/')[0] == "text":
    #pass/raise here

(MIME types list here: http://www.freeformatter.com/mime-types-list.html)

Upvotes: 0

Views: 1232

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121814

Use r.raise_for_status() to raise an exception for responses with 4xx and 5xx status codes, or test the r.status_code explicitly.

r.raise_for_status() raises an HTTPError exception, which is a subclass of RequestException which you already catch:

try:
    r = requests.get(url, stream=True, verify=False)
    r.raise_for_status()  # raises if not a 2xx or 3xx response
    total_length = int(r.headers['Content-Length'])

    if total_length:
        # etc.    
except (requests.RequestException, StandardError):
    pass

The r.status_code check would let you narrow down what you consider a proper response code. Do note that 3xx redirects are handled automatically, and you won't see other 3xx responses as requests won't send conditional requests in this case, so there is little need for explicit tests here. But if you do, it'd look something like:

r = requests.get(url, stream=True, verify=False)
r.raise_for_status()  # raises if not a 2xx or 3xx response
total_length = int(r.headers['Content-Length'])

if 200 <= r.status_code < 300 and total_length:
    # etc.

Upvotes: 4

Maciej Gol
Maciej Gol

Reputation: 15854

if r.status_code == 404:
    handle404()
else:
    download()

Upvotes: 1

Related Questions