Ryan Weinstein
Ryan Weinstein

Reputation: 7285

urllib2 request returns a different page about 1 in 5 times

import urllib2

req = urllib2.Request('http://www.amazon.com/Sweet-Virgin-Organic-Coconut-13-5oz/dp/B00Q5CIL4Y', headers={ 'User-Agent': 'Mozilla/5.0' })

html = urllib2.urlopen(req).read()
print len(html)

That's the smallest example I can make. If you run that then ~1 in 5 times the length of the response will be 5769, and the other times it will be a normal usable response.

Whats up with this?

edit:

incorrect response: http://pastebin.com/d7zdy0uv

Upvotes: 0

Views: 60

Answers (2)

Steve Jessop
Steve Jessop

Reputation: 279325

Given the content of the short responses, this becomes much easier to answer. Amazon suspects you're doing automated scraping of its site, and has served you a CAPTCHA that, if you were a human using a browser, you could solve.

I'm slightly surprised it only hits you one in five requests, though, rather than either always or never.

As it says in Amazon's response, consider using their APIs instead.

Upvotes: 2

John
John

Reputation: 2480

It looks like it must be an issue on your side- I've run it ~50 times and I'm getting ~490000 or so every time.

You are being rate limited.

Check the length of data, when you detect a short packet you need to wait a period until you are not rate limited. (You'll have to figure out what rate is sustainable)

Upvotes: 0

Related Questions