foosion
foosion

Reputation: 7898

python lxml.html.parse not reading url

Why is html.parse(url) failing, when using requests then html.fromstring works and html.parse(url2) works? lxml 3.4.2

    Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    html.parse(url)
  File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>

Upvotes: 2

Views: 1042

Answers (2)

alecxe
alecxe

Reputation: 473903

Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden status code if User-Agent header is not supplied.

According to the lxml's source code, it uses urllib2.urlopen() without supplying a User-Agent header which results into 403, which results into failed to load HTTP resource error.

On the other hand, requests provides a default User-Agent header if not explicitly passed:

>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'

To prove the point, set the User-Agent header to None and see:

>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403

Upvotes: 2

michael_stackof
michael_stackof

Reputation: 223

When the http status not 200, html.parse will quit.

the return status of http://www.oddschecker.com. enter image description here

Upvotes: 2

Related Questions