Reputation: 7096
I am trying to download a PDF by hitting a URL. Say my URL looks like this: http://foo.bar/this/downloads/pdf
If I hit the URL directly, the browser downloads the PDF, with no problem. However, if I try to get the PDF using urllib2.urlopen
I get an incomplete file.
url = "http://foo.bar/this/downloads/pdf"
sock = urllib2.urlopen(url)
content = sock.read()
with open('/tmp/test.pdf', 'w') as f:
f.write(content)
The last 3 lines of /tmp/test.pdf look like this (and it looks like this in the variable content
):
0000778731 00000 n
0000778751 00000 n
000
But the actual file that I downloaded from the browser looks like this:
0000778731 00000 n
0000778751 00000 n
0000778772 00000 n
...
%%EOF
Every single PDF, regardless of size, seems to cut off somewhere in this final combination of numbers.
I have tried the following solutions, and both do not work. I believe the reason does not have to do with the way in which the data is read, but the fact that the urllib2
is not even getting the full response in the first place.
python,not getting full response
urllib2 not retrieving entire HTTP response
Another thing that may be a factor (though I'm unsure) is the way the PDF is sent to the browser. To my knowledge, the PDF is sent using PHP x-sendfile. I am just confused as to why the PDF is partially downloaded.
Upvotes: 0
Views: 1040
Reputation: 229361
You have to open the file for writing in binary mode (note the wb
):
with open('/tmp/test.pdf', 'wb') as f:
f.write(content)
EDIT: Oh, you also have to keep reading until .read()
returns nothing:
url = "http://foo.bar/this/downloads/pdf"
sock = urllib2.urlopen(url)
with open('/tmp/test.pdf', 'wb') as f:
while True:
content = sock.read()
if not content: break
f.write(content)
From the urllib
documentation:
One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.
This caveat doesn't appear in the documentation for urllib2
, but the same concept applies.
Upvotes: 2