Reputation: 9402
I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:
<script type="text/javascript">
... // some javascript code
if (condition1 && condition2) { // croaks on this line
I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.
What's going on here? How is inline javascript able to break my parse, and what can I do about it?
Update: per request, here is the code used with lxml.
>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22
The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.
Upvotes: 3
Views: 1502
Reputation: 22433
There are a lot of really crappy ways for HTML parsing to break. Bad HTML is ubiquitous, and both script
sections and various templating languages throw monkey wrenches into the works.
But, you also seem to be using XML-oriented parsers for the job, which are stricter and thus much, much more likely to break if not presented with exactly-right, totally valid input. Which most HTML--including most XHTML--manifestly is not.
So, use a parser designed to overlook some of the HTML gotchas:
import lxml.html
d = lxml.html.parse(URL)
That should take you off to the races.
Upvotes: 5