Reputation: 7898
Why is html.parse(url)
failing, when using requests
then html.fromstring
works and html.parse(url2)
works? lxml 3.4.2
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
html.parse(url)
File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>
Upvotes: 2
Views: 1042
Reputation: 473903
Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden
status code if User-Agent
header is not supplied.
According to the lxml
's source code, it uses urllib2.urlopen()
without supplying a User-Agent
header which results into 403
, which results into failed to load HTTP resource
error.
On the other hand, requests
provides a default User-Agent
header if not explicitly passed:
>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'
To prove the point, set the User-Agent
header to None
and see:
>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403
Upvotes: 2
Reputation: 223
When the http status not 200, html.parse will quit.
the return status of http://www.oddschecker.com.
Upvotes: 2