Reputation: 497
I just got started with the urllib
module. I'm trying to scrape products from supermarkets and there's a website that seems to always respond with an HTTP Error 429: Too many requests
. I already did a bit of research on the Stack Overflow and no one seems to have the same problem. My code is as simple as it can get:
>>> import urllib.request
>>> resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 640, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 568, in error
return self._call_chain(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 648, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
I've also tried to modify the user-agent as this answer suggests, but the result is still the same
Can someone explain which default settings inside the urllib module may cause the problem? Or is it because the website blocks bots? Other product pages of the website don't work either.
Upvotes: 0
Views: 1647
Reputation: 9185
429 is server asking you to stop. Basically, the web server thinks you are trying to spam or scrape and it doesn't like it. Generally you should honor the server and if there is try after some time with 429 response you should follow it.
If you feel you are wrongly been asked by the server, either you can make sure that your user request is **similar" to the user request generated by an user from the browser, which will include user-agent and all the other information a regular browser would send with the request. If the server is sending you 429 despite that most probably either it has blocked your ip temporarily or permanently. In that you should look how to scrape through multiple ips.
Upvotes: 1