Reputation: 851
I'm trying to open and parse a html page. In python 2.7.8 I have no problem:
import urllib
url = "https://ipdb.at/ip/66.196.116.112"
html = urllib.urlopen(url).read()
and everything is fine. However I want to move to python 3.4 and there I get HTTP error 403 (Forbidden). My code:
import urllib.request
html = urllib.request.urlopen(url) # same URL as before
File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 461, in open
response = meth(req, response)
File "C:\Python34\lib\urllib\request.py", line 574, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python34\lib\urllib\request.py", line 499, in error
return self._call_chain(*args)
File "C:\Python34\lib\urllib\request.py", line 433, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 582, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
It work for other URLs which don't use https.
url = 'http://www.stopforumspam.com/ipcheck/212.91.188.166'
is ok.
Upvotes: 29
Views: 32048
Reputation: 546
The urllib request HTTP 403 error occurs due to a server security feature that blocks known bot user-agents
.
Here are possible solutions in order of feasibility (easiest to apply first):-
Add a different user-agent
that's just NOT considered a bot.
from urllib.request import Request, urlopen
web = "https://www.festo.com/de/de"
headers = {
"User-Agent": "XYZ/3.0",
"X-Requested-With": "XMLHttpRequest"
}
request = Request(web, headers=headers)
content = urlopen(request).read()
Optionally, you can set a short timeout for the request, if you're running multiple requests consecutively.
content = urlopen(request,timeout=10).read()
Add a cookie from your browser after opening the url manually and accepting all cookies.
from urllib.request import Request, urlopen
web = "https://www.festo.com/de/de"
headers = {
"User-Agent": "XYZ/3.0",
"X-Requested-With": "XMLHttpRequest",
"cookie": "value stored in your webpage"
}
request = Request(web, headers=headers)
content = urlopen(request).read()
If you're using chrome, you can log onto the web
url and open the inspector (press F12), then choose the Application tab, then from the left tree choose Cookies under Storage
If obtaining the cookie needs to be done for several websites, it would be deemed wise to create the request
using the Session
object due to it's compatibility with cookies.
import requests
web = "https://www.festo.com/de/de"
headers = {
"User-Agent": "XYZ/3.0",
"X-Requested-With": "XMLHttpRequest"
}
request = requests.Session()
content = request.get(web,headers=headers).text
If SSL certificate verification fails while using urllib
from urllib.request import Request, urlopen
import ssl
web = "https://www.festo.com/de/de"
headers = {
"User-Agent": "XYZ/3.0",
"X-Requested-With": "XMLHttpRequest"
}
request = Request(web, headers=headers)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
content = urlopen(request,context=ctx).read()
Credits to the following Question 1
, Question 2
, SSL-Certificate
Upvotes: 1
Reputation:
Here are some notes I gathered on urllib
when I was studying python-3:
I kept them in case they might come in handy or help someone else out.
urllib.request
and urllib.parse
:import urllib.request as urlRequest
import urllib.parse as urlParse
url = "http://www.example.net"
# open the url
x = urlRequest.urlopen(url)
# get the source code
sourceCode = x.read()
url = "https://www.example.com"
values = {"q": "python if"}
# encode values for the url
values = urlParse.urlencode(values)
# encode the values in UTF-8 format
values = values.encode("UTF-8")
# create the url
targetUrl = urlRequest.Request(url, values)
# open the url
x = urlRequest.urlopen(targetUrl)
# get the source code
sourceCode = x.read()
403 forbidden
responses):url = "https://www.example.com"
values = {"q": "python urllib"}
# pretend to be a chrome 47 browser on a windows 10 machine
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
# encode values for the url
values = urlParse.urlencode(values)
# encode the values in UTF-8 format
values = values.encode("UTF-8")
# create the url
targetUrl = urlRequest.Request(url = url, data = values, headers = headers)
# open the url
x = urlRequest.urlopen(targetUrl)
# get the source code
sourceCode = x.read()
403 forbidden
responses):url = "https://www.example.com"
# pretend to be a chrome 47 browser on a windows 10 machine
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
req = urlRequest.Request(url, headers = headers)
# open the url
x = urlRequest.urlopen(req)
# get the source code
sourceCode = x.read()
Upvotes: 3
Reputation: 369134
It seems like the site does not like the user agent of Python 3.x.
Specifying User-Agent
will solve your problem:
import urllib.request
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
NOTE Python 2.x urllib version also receives 403 status, but unlike Python 2.x urllib2 and Python 3.x urllib, it does not raise the exception.
You can confirm that by following code:
print(urllib.urlopen(url).getcode()) # => 403
Upvotes: 45