Reputation: 7285
import urllib2
req = urllib2.Request('http://www.amazon.com/Sweet-Virgin-Organic-Coconut-13-5oz/dp/B00Q5CIL4Y', headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()
print len(html)
That's the smallest example I can make. If you run that then ~1 in 5 times the length of the response will be 5769, and the other times it will be a normal usable response.
Whats up with this?
edit:
incorrect response: http://pastebin.com/d7zdy0uv
Upvotes: 0
Views: 60
Reputation: 279325
Given the content of the short responses, this becomes much easier to answer. Amazon suspects you're doing automated scraping of its site, and has served you a CAPTCHA that, if you were a human using a browser, you could solve.
I'm slightly surprised it only hits you one in five requests, though, rather than either always or never.
As it says in Amazon's response, consider using their APIs instead.
Upvotes: 2
Reputation: 2480
It looks like it must be an issue on your side- I've run it ~50 times and I'm getting ~490000 or so every time.
You are being rate limited.
Check the length of data, when you detect a short packet you need to wait a period until you are not rate limited. (You'll have to figure out what rate is sustainable)
Upvotes: 0