Reputation: 316
I am writing a Python script to automate downloading some pdf pages (from public domain work) hosted at a website. Unfortunately the individual pdf pages are embedded in frames, and when I used the following:
import time, urllib
for n in range(21,63):
time.sleep(2)
pdfPath="http://babel.hathitrust.org/cgi/imgsrv/download/pdf?id=wu.89038803698;orient=0;size=100;seq=%s;attachment=0"%(str(n))
pdfName="Housner_"+str(n)+".pdf"
f = open(pdfName, 'w')
f.write(urllib.urlopen(pdfPath).read())
f.close()
time.sleep(2)
the files downlaoded were actually blank, and Adobe shows error, e.g. invalid image, embedded fonts etc. not found.
Can anyone kindly suggest me how to improve this script so that the PDFs downloaded are not errorneous/corrupt.
Thanks.
Upvotes: 1
Views: 1630
Reputation: 5231
You are writing binary information as if it were non-binary.
f = open(pdfName,'wb')
should do the trick.
Upvotes: 3