rpb
rpb

Reputation: 3299

How to prevent downloading an empty pdf file while using get and requests in Python?

I am scraping a website which is accessible from this link, using Beautiful Soup. The idea is to download all href that contain the string .pdf using the get module.

The code below demonstrated the procedure and is working as intended:

filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
with open(filename, 'wb') as f:
    f.write(requests.get(url_to_download_pdf).content)

However, there is instance where the url such as given above (i.e., the variable url_to_download_pdf) direct to Page not found page. As a result, an unusable and unreadable pdf is downloaded.

Opening the file with pdf reader in Windows give the following warning

enter image description here

I am curious if there is any ways to avoid accessing and downloading an invalid pdf file?

Upvotes: 0

Views: 1316

Answers (3)

rpb
rpb

Reputation: 3299

Thanks for the suggestion by the user.

As per @Nicolas,

Do the save as pdf only if the response return 200

if response.status_code == 200:

In the previous version, an empty file will be created regardless of the response because following with open(filename, 'wb') as f: was created before the checking status_code

To mitigate this, the with open(filename, 'wb') as f: should be initiated only if the condition set was as intended.

The complete code then is as below:

import requests
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
my_req = requests.get(url_to_download_pdf)
if my_req.status_code == 200:
    with open(filename, 'wb') as f:
        f.write(my_req.content)

Upvotes: 1

Nicolas Acosta
Nicolas Acosta

Reputation: 807

You have to validate that the file you request for, already exists. If the file exists, the response code of the request will be 200. So here an example of how to do that:

filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
with open(filename, 'wb') as f:
    response = requests.get(url_to_download_pdf)
    if response.status_code == 200:
        f.write(response.content)
    else:
        print("Error, the file doesn't exist")

Upvotes: 1

Romit
Romit

Reputation: 338

Instead of directly accessing the contents of the file with f.write(requests.get(url_to_download_pdf).content)

You can first check the status of the request, and then if it is a valid request, then only save to file.

filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
response = requests.get(url_to_download_pdf)
if(response.status_code != 404):
    with open(filename, 'wb') as f:
        f.write(response.content)

Upvotes: 1

Related Questions