How can I tell programatically if a filename I am asking for exists on a webserver?

Question

I am helping somebody pull a bunch (tens of thousands) of pdf files from a website. We have the pattern for the file names but not all of the files will exist. I am assuming it is rude to ask for a file that does not exist, particularly at this scale. I am using python and in my tests of urllib2 I have found that this snippet gets me the file if it exists

s=urllib.urlretrieve('http://website/directory/filename.pdf','c:\destination.pdf')

If the file does not exist then I get a file that has the name I assigned but the text from their 404 page. Now I can handle this after I am done (read the files and delete all of the 404 pages) but that does not seem very nice to their server nor is it very pythonic.

I tried messing with the looking at the various functions in urllib and urlretrieve and do not see anything that tells me if the file exists.

jterrace · Accepted Answer

You can check the return code of the response. It will be 200 for existing PDFs and 404 for non-existing PDFs. You can use the requests library to make this a lot easier:

>>> import requests
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.png')
>>> r.status_code
200
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.xxx')
>>> r.status_code
404

How can I tell programatically if a filename I am asking for exists on a webserver?

Answers (1)

Related Questions