Python downloading PDF with urllib2 creates corrupt document

Question

Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:

import urllib2
def pdfDownload(url):
    response=urllib2.urlopen(url)
    expdf=response.read()
    egpdf=open('ex.pdf','wb')
    egpdf.write(expdf)
    egpdf.close()

I tried this function out with:

 pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')

At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.

I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.

When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".

I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.

nrlakin · Accepted Answer

The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.

I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:

import requests
def pdfDownload(url):
    response=requests.get(url)
    expdf=response.content
    egpdf=open('ex.pdf','wb')
    egpdf.write(expdf)
    egpdf.close()

If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.

Python downloading PDF with urllib2 creates corrupt document

Answers (1)

Related Questions