Reputation: 1140
I have been using Amazon's Product Advertising API to generate urls that contains prices for a given book. One url that I have generated is the following:
When I click on the link or paste the link on the address bar, the web page loads fine. However, when I execute the following code I get an error:
url = "http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327"
html_contents = urllib2.urlopen(url)
The error is urllib2.HTTPError: HTTP Error 503: Service Unavailable. First of all, I don't understand why I even get this error since the web page successfully loads.
Also, another weird behavior that I have noticed is that the following code sometimes does and sometimes does not give the stated error:
html_contents = urllib2.urlopen("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")
I am totally lost on how this behavior occurs. Is there any fix or work around to this? My goal is to read the html contents of the url.
EDIT
I don't know why stack overflow is changing my code to change the amazon link I listed above in my code to rads.stackoverflow. Anyway, ignore the rads.stackoverflow link and use my link above between the quotes.
Upvotes: 23
Views: 61386
Reputation: 8970
Ben's answer is the correct and the accepted answer to the OP's question (I guess; I haven't validated it).
However, since this question is the first hit when googling for 'python 503 urllib2', and it does not solve the problem I spent the last three hours investigating... I'll offer an alternate answer.
If you're unlucky enough to be dealing with Python 2.6 (in 2023!), you're probably really out of luck.
TLS' SNI was first introduced in Python 3 and backported to Python 2.7 (see https://stackoverflow.com/a/27717544/8280541), but never added to 2.6 or prior. Nowadays, many if not most HTTPS servers will depend on SNI.
Long story short, SNI allows a single HTTPS server to serve several different sites, with potentially different certificates. When a client connects to the server via HTTPS, one of the first things it does is to provide the server its SNI (Server Name Indication), even before they talk certs. With that information, the server will be able to provide to the client the content of the site they want, and the associated certificate.
To confirm you're in this situation:
curl
or some other toolopenssl s_client -connect my.server.url:443 < /dev/null | grep subject
; take note of the server name it shows (it should be the one you're expecting)openssl s_client -noservername -connect my.server.url:443 < /dev/null | grep subject
; the server name changedIf that's the case, Python 2.6 is not for you. Time to upgrade, perhaps? If you still insist in using ancient and unsupported software, one alternative is to run curl
through subprocess.Popen
to get what you want.
Upvotes: 0
Reputation: 2280
Amazon is rejecting the default User-Agent for urllib2 . One workaround is to use the requests module
import requests
page = requests.get("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")
html_contents = page.text
If you insist on using urllib2, this is how a header can be faked to do it:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327')
html_contents = response.read()
Don't worry about stackoverflow editing the URL. They explain that they are doing this here.
Upvotes: 28
Reputation: 6767
It's because Amazon don't allow automated access to their data, so they're rejecting your request because it didn't come from a proper browser. If you look at the content of the 503 response, it says:
To discuss automated access to Amazon data please contact [email protected]. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
This is because the User-Agent
for Python's urllib
is so obviously not a browser. You could always fake the User-Agent
, but that's not really good (or moral) practice.
As a side note, as mentioned in another answer, the requests
library is really good for HTTP access in Python.
Upvotes: 15