Reputation: 8089
I am novice to lxml. I want to download the web page and get interested data from, my code is:
import urllib2
from lxml import etree
url = "http://www.example.com/"
html = urllib2.urlopen(url)
root = etree.parse(html) # the problem is here
can anyone explain me why it is wrong?
error is:
Traceback (most recent call last):
File "yatego.py", line 10, in <module>
root = etree.parse(html)
File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703)
File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012)
File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908)
File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21
This code:
url = "http://www.example.com/"
res = requests.get(url)
doc = lxml.html.parse(res.content)
gives this error:
File "yatego.py", line 11, in <module>
doc = lxml.html.parse(res.content)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485)
File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843)
File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>IANA — Example domains</title>
This code:
doc = lxml.html.parse(url)
works fine
So where is the problem?
Upvotes: 10
Views: 23001
Reputation: 7380
You should use html.read()
to begin with: HTML is not a string type. Also, you should really check if the URL downloaded properly, as this is by no means assured.
UPD. Use html.parse(filename_or_url)
Upvotes: 0
Reputation: 52371
The key here is the exception:
IOError: Error reading file '<!DOCTYPE html PUBLIC ...
Youre passing the content of a file to a function that expects a path to a file.
Same reason doc = lxml.html.parse(url)
works, a url "is a" filepath.
Does the following work better?
doc = lxml.html.fromstring(res.content)
Upvotes: 11
Reputation: 53829
You should use lxml.html
to parse HTML instead of lxml.etree
.
You can also open the url directly with lxml
:
doc = lxml.html.parse(url)
Sometimes lxml
will have trouble dealing with HTTP's quirks, in which case you'd need to use a more robust solution to fetch pages, like requests
:
res = requests.get(url)
doc = lxml.html.parse(res.content)
Upvotes: 6