P Mav
P Mav

Reputation: 13

Python Mechanize: Gateway Time-out when opening url, but url opens fine in internet browser

I am scraping hotel room data from expedia.co.uk using Python (2.7) mechanize (on a Mac), looping through a list of about 1000 url's (200 hotels and 5 different periods).

When I ran the code, it worked fine for the first 200 and then gave me the following error:

httperror_seek_wrapper: Gateway Time-out

since then, it always gives me this error for anything I try to load from the expedia website, although opening the same url from internet explorer/Chrome works fine.

Here's an example code:

import mechanize
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.set_handle_refresh(False)
url = 'https://www.expedia.co.uk/Massena-Square-Hotels-Hotel-Aston-La-Scala.h53477.Hotel-Information?&rm1=a1&chkout=02/12/2016&chkin=01/12/2016'
r = br.open(url, timeout = 2.0)
soup = BeautifulSoup(r,'lxml')

And this is the traceback:

Traceback (most recent call last):

File "", line 5, in r = br.open(url, timeout = 2.0)

File "build/bdist.macosx-10.5-x86_64/egg/mechanize/_mechanize.py", line 203, in open return self._mech_open(url, data, timeout=timeout)

File "build/bdist.macosx-10.5-x86_64/egg/mechanize/_mechanize.py", line 255, in _mech_open raise response

httperror_seek_wrapper: Gateway Time-out

I tried different timeouts, and using different IP addresses, same error. Is there any way around this?

Upvotes: 1

Views: 723

Answers (1)

7stud
7stud

Reputation: 48599

I can get rid of the timeout error using:

br.addheaders.append(
    ('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9')
)

If you print out the mechanize headers for a simple request to a random website, you'll see something like this:

import mechanize

br = mechanize.Browser()
br.set_handle_refresh(False)

url = 'http://www.example.com'
r = br.open(url, timeout = 2.0)

request = br.request
print(request.header_items())

--output:--
[('Host', 'www.example.com'), ('User-agent', 'Python-urllib/2.7')]

The default mechanize headers identify the request as being sent by a computer program, 'Python-urllib/2.7', which the website does not approve of.

If you use your browser's developer tools, you can examine the request that the browser sends to your url. Under the Network tab, look at the request headers, and you'll see headers that are different from the default mechanize headers. In your mechanize request, you just need to duplicate the headers that your browser sends . It turns out that if you identify your request as coming from a browser, rather than a python program, then the request will succeed without adding any of the other headers that the browser sends.

Upvotes: 1

Related Questions