user3774185
user3774185

Reputation: 251

Download pdf using urllib?

I am trying to download a pdf file from a website using urllib. This is what i got so far:

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

When i run this code, all I get is an empty pdf file. What am I doing wrong?

Upvotes: 23

Views: 62688

Answers (6)

x89
x89

Reputation: 3460

FYI: You can also use wget to download url pdfs easily. Urllib versions keep changing and often cause issues (at least for me).

import wget

wget.download(link)

Instead of entering the pdf link, you can also modify your code such that you enter a webpage link and extract all pdfs from there. Here's a guide for that: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Upvotes: 1

jamiemcg
jamiemcg

Reputation: 389

Here is an example that works:

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()

Upvotes: 25

romulomadu
romulomadu

Reputation: 657

Try to use urllib.retrieve (Python 3) and just do that:

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

Upvotes: 7

Piyush Rumao
Piyush Rumao

Reputation: 383

The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)

Upvotes: 4

Piyush Rumao
Piyush Rumao

Reputation: 383

I would suggest using following lines of code

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)

Upvotes: 1

shockburner
shockburner

Reputation: 495

Change open('some_file.pdf', 'w') to open('some_file.pdf', 'wb'), pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.

Upvotes: 13

Related Questions