Reputation: 251
I am trying to download a pdf file from a website using urllib. This is what i got so far:
import urllib
def download_file(download_url):
web_file = urllib.urlopen(download_url)
local_file = open('some_file.pdf', 'w')
local_file.write(web_file.read())
web_file.close()
local_file.close()
if __name__ == 'main':
download_file('http://www.example.com/some_file.pdf')
When i run this code, all I get is an empty pdf file. What am I doing wrong?
Upvotes: 23
Views: 62688
Reputation: 3460
FYI: You can also use wget to download url pdfs easily. Urllib versions keep changing and often cause issues (at least for me).
import wget
wget.download(link)
Instead of entering the pdf link, you can also modify your code such that you enter a webpage link and extract all pdfs from there. Here's a guide for that: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
Upvotes: 1
Reputation: 389
Here is an example that works:
import urllib2
def main():
download_file("http://mensenhandel.nl/files/pdftest2.pdf")
def download_file(download_url):
response = urllib2.urlopen(download_url)
file = open("document.pdf", 'wb')
file.write(response.read())
file.close()
print("Completed")
if __name__ == "__main__":
main()
Upvotes: 25
Reputation: 657
Try to use urllib.retrieve
(Python 3) and just do that:
from urllib.request import urlretrieve
def download_file(download_url):
urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')
if __name__ == 'main':
download_file('http://www.example.com/some_file.pdf')
Upvotes: 7
Reputation: 383
The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.
from urllib.request import Request, urlopen
import requests
url="https://realpython.com/python-tricks-sample-pdf"
import urllib.request
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
r = requests.get(url)
with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
code.write(r.content)
Upvotes: 4
Reputation: 383
I would suggest using following lines of code
import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Upvotes: 1
Reputation: 495
Change open('some_file.pdf', 'w')
to open('some_file.pdf', 'wb')
, pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.
Upvotes: 13