Phil
Phil

Reputation: 7096

Python urllib2 not obtaining full response (PDF)

I am trying to download a PDF by hitting a URL. Say my URL looks like this: http://foo.bar/this/downloads/pdf

If I hit the URL directly, the browser downloads the PDF, with no problem. However, if I try to get the PDF using urllib2.urlopen I get an incomplete file.

url = "http://foo.bar/this/downloads/pdf"
sock = urllib2.urlopen(url)
content = sock.read()
with open('/tmp/test.pdf', 'w') as f:
    f.write(content)

The last 3 lines of /tmp/test.pdf look like this (and it looks like this in the variable content):

0000778731 00000 n 
0000778751 00000 n 
000

But the actual file that I downloaded from the browser looks like this:

0000778731 00000 n 
0000778751 00000 n 
0000778772 00000 n 
...
%%EOF

Every single PDF, regardless of size, seems to cut off somewhere in this final combination of numbers.

I have tried the following solutions, and both do not work. I believe the reason does not have to do with the way in which the data is read, but the fact that the urllib2 is not even getting the full response in the first place.

python,not getting full response

urllib2 not retrieving entire HTTP response

Another thing that may be a factor (though I'm unsure) is the way the PDF is sent to the browser. To my knowledge, the PDF is sent using PHP x-sendfile. I am just confused as to why the PDF is partially downloaded.

Upvotes: 0

Views: 1040

Answers (1)

Claudiu
Claudiu

Reputation: 229361

You have to open the file for writing in binary mode (note the wb):

with open('/tmp/test.pdf', 'wb') as f:
    f.write(content)

EDIT: Oh, you also have to keep reading until .read() returns nothing:

url = "http://foo.bar/this/downloads/pdf"
sock = urllib2.urlopen(url)
with open('/tmp/test.pdf', 'wb') as f:
    while True:
        content = sock.read()
        if not content: break
        f.write(content)

From the urllib documentation:

One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.

This caveat doesn't appear in the documentation for urllib2, but the same concept applies.

Upvotes: 2

Related Questions