Reputation: 33
I have a bunch of URLs of websites. Some of these URLs are no longer accessible as the website maybe deleted or for some other reasons. Could you help me how I can find it out.. I have tried the following code:
def url_ok(url):
try:
r = requests.head(url)
return r.status_code
except:
print("Status: False")
I was expecting that if an error occures while executing the script it probably means that the website doesnot exist anymore.. But to my surprise some of the urls flagged "false" due to the exception were actually accessible when i tried to access them manually using the browser. So i guess my code doesnot work.. Could you guys help me how to find out that the urls are accessible or not using python? I am using the urls in the Column "Websites" in this spreadsheet: URls
Upvotes: 0
Views: 1610
Reputation: 34086
You can use requests module and do a GET
call to check response as 200
. Like this maybe:
In [292]: response = requests.get('https://stackoverflow.com/questions/61059821/using-python-how-do-i-check-a-website-is-accessible-or-not')
In order to seem like the request is from browser you can do the following:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
In [296]: response.status_code
Out[296]: 200
Just loop around your list of URL's and check whether the response is 200 or not.
for url in urls_list:
response = requests.get(url)
if response.status_code == 200:
print('{} is active'.format(url))
Upvotes: 1
Reputation: 554
Additionally to what's been said about requests
, make sure your get requests are called with allow_redirects=True
.
Upvotes: 0
Reputation: 36630
HTTP response status codes are divided in five classes, thus I suggest simply flagging all urls which produce status_code < 400
as OK.
Edit: request' response has ok, working exactly this way, per its help:
| ok
| Returns True if :attr:`status_code` is less than 400, False if not.
|
| This attribute checks if the status code of the response is between
| 400 and 600 to see if there was a client error or a server error. If
| the status code is between 200 and 400, this will return True. This
| is **not** a check to see if the response code is ``200 OK``.
So you might just do:
import requests
r = requests.head('http://www.example.com')
print(r.ok)
Output:
True
Upvotes: 0