Reputation: 4940
I am helping somebody pull a bunch (tens of thousands) of pdf files from a website. We have the pattern for the file names but not all of the files will exist. I am assuming it is rude to ask for a file that does not exist, particularly at this scale. I am using python and in my tests of urllib2 I have found that this snippet gets me the file if it exists
s=urllib.urlretrieve('http://website/directory/filename.pdf','c:\\destination.pdf')
If the file does not exist then I get a file that has the name I assigned but the text from their 404 page. Now I can handle this after I am done (read the files and delete all of the 404 pages) but that does not seem very nice to their server nor is it very pythonic.
I tried messing with the looking at the various functions in urllib and urlretrieve and do not see anything that tells me if the file exists.
Upvotes: 1
Views: 167
Reputation: 67073
You can check the return code of the response. It will be 200 for existing PDFs and 404 for non-existing PDFs. You can use the requests library to make this a lot easier:
>>> import requests
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.png')
>>> r.status_code
200
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.xxx')
>>> r.status_code
404
Upvotes: 6