Reputation: 13
I am scraping hotel room data from expedia.co.uk using Python (2.7) mechanize (on a Mac), looping through a list of about 1000 url's (200 hotels and 5 different periods).
When I ran the code, it worked fine for the first 200 and then gave me the following error:
httperror_seek_wrapper: Gateway Time-out
since then, it always gives me this error for anything I try to load from the expedia website, although opening the same url from internet explorer/Chrome works fine.
Here's an example code:
import mechanize from bs4 import BeautifulSoup br = mechanize.Browser() br.set_handle_refresh(False) url = 'https://www.expedia.co.uk/Massena-Square-Hotels-Hotel-Aston-La-Scala.h53477.Hotel-Information?&rm1=a1&chkout=02/12/2016&chkin=01/12/2016' r = br.open(url, timeout = 2.0) soup = BeautifulSoup(r,'lxml')
And this is the traceback:
Traceback (most recent call last):
File "", line 5, in r = br.open(url, timeout = 2.0)
File "build/bdist.macosx-10.5-x86_64/egg/mechanize/_mechanize.py", line 203, in open return self._mech_open(url, data, timeout=timeout)
File "build/bdist.macosx-10.5-x86_64/egg/mechanize/_mechanize.py", line 255, in _mech_open raise response
httperror_seek_wrapper: Gateway Time-out
I tried different timeouts, and using different IP addresses, same error. Is there any way around this?
Upvotes: 1
Views: 723
Reputation: 48599
I can get rid of the timeout error using:
br.addheaders.append(
('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9')
)
If you print out the mechanize headers for a simple request to a random website, you'll see something like this:
import mechanize
br = mechanize.Browser()
br.set_handle_refresh(False)
url = 'http://www.example.com'
r = br.open(url, timeout = 2.0)
request = br.request
print(request.header_items())
--output:--
[('Host', 'www.example.com'), ('User-agent', 'Python-urllib/2.7')]
The default mechanize headers identify the request as being sent by a computer program, 'Python-urllib/2.7', which the website does not approve of.
If you use your browser's developer tools, you can examine the request that the browser sends to your url. Under the Network tab, look at the request headers, and you'll see headers that are different from the default mechanize headers. In your mechanize request, you just need to duplicate the headers that your browser sends . It turns out that if you identify your request as coming from a browser, rather than a python program, then the request will succeed without adding any of the other headers that the browser sends.
Upvotes: 1