Reputation: 3299
I am scraping a website which is accessible from this link, using Beautiful Soup. The idea is to download all href
that contain the string .pdf
using the get
module.
The code below demonstrated the procedure and is working as intended:
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
with open(filename, 'wb') as f:
f.write(requests.get(url_to_download_pdf).content)
However, there is instance where the url such as given above (i.e., the variable url_to_download_pdf
) direct to Page not found
page. As a result, an unusable and unreadable pdf is downloaded.
Opening the file with pdf reader in Windows give the following warning
I am curious if there is any ways to avoid accessing and downloading an invalid pdf
file?
Upvotes: 0
Views: 1316
Reputation: 3299
Thanks for the suggestion by the user.
As per @Nicolas,
Do the save as pdf only if the response return 200
if response.status_code == 200:
In the previous version, an empty file will be created regardless of the response because following with open(filename, 'wb') as f:
was created before the checking status_code
To mitigate this, the with open(filename, 'wb') as f:
should be initiated only if the condition set was as intended.
The complete code then is as below:
import requests
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
my_req = requests.get(url_to_download_pdf)
if my_req.status_code == 200:
with open(filename, 'wb') as f:
f.write(my_req.content)
Upvotes: 1
Reputation: 807
You have to validate that the file you request for, already exists. If the file exists, the response code of the request will be 200. So here an example of how to do that:
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
with open(filename, 'wb') as f:
response = requests.get(url_to_download_pdf)
if response.status_code == 200:
f.write(response.content)
else:
print("Error, the file doesn't exist")
Upvotes: 1
Reputation: 338
Instead of directly accessing the contents of the file with
f.write(requests.get(url_to_download_pdf).content)
You can first check the status of the request, and then if it is a valid request, then only save to file.
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
response = requests.get(url_to_download_pdf)
if(response.status_code != 404):
with open(filename, 'wb') as f:
f.write(response.content)
Upvotes: 1