Reputation: 21692
I'm trying to get ulr status via urllib.request.urlopen
and in some cases it return urllib.error.URLError: HTTP Error 403: Forbidden
howewer I can open this url from browser successfully.
Is it possible to overcome this problem with urllib
or better to use some other lib?
def urllib_status(url):
REQUEST_TIMEOUT = 10
if 'http' not in url:
url = 'http://' + url
try:
response = urllib.request.urlopen(url, timeout=REQUEST_TIMEOUT)
return response.status
except urllib.error.URLError as e:
print('url:'+url)
print('urllib.error.URLError:', e)
return -1
except ssl.SSLError as e:
print('url:'+url)
print('ssl.SSLError:', e)
return -1
except socket.error as e:
print('url:'+url)
print("socket.error: ", e)
return -1
Upvotes: 0
Views: 1464
Reputation: 5564
The problem is likely to be due to the site not accepting non-browser requests. You can work around it by overriding the User-Agent
header in your request (default is Python-urllib/3.X
).
From Python docs:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
Or, if you're using requests
(the de facto standard HTTP library among Python users)
import requests
requests.get('http://www.example.com/', headers={'User-agent': 'Mozilla/5.0'})
Upvotes: 1
Reputation: 21692
It's simler using requests
:
def url_status(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0)'
' Gecko/20100101 Firefox/24.0'}
REQUEST_TIMEOUT = 10
if 'http' not in url:
url = 'http://' + url
try:
response = requests.get(url, headers=headers, timeout=REQUEST_TIMEOUT)
if(response.status_code != 200):
print(url)
print('status',response.status_code)
return response.status_code
except Exception as e:
print(url)
print('Error',e)
return -1
Upvotes: 0