Reputation: 1300
I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:
from lxml import html, etree
text = """
<p class="goal">
<strong>Goal</strong> <br />
<ul><li>test</li></ul>
</p>
"""
tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)
Running the code produces
<p class="goal">
<strong>Goal</strong> <br/>
</p>
As you can see, the entire <ul>
block is lost. This also means that it is not possible to address the <ul>
with an XPath along the lines of //p[@class="goal"]/ul
, as the <ul>
is not counted as a child of the <p>
.
Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the <p>
? The thing is embedded in a larger website, and it is not guaranteed that there will even be a <ul>
tag (there may be another <p>
inside, or anything else, for that matter).
Update: Updated title after answer was received to make finding this question easier for people with the same problem.
Upvotes: 1
Views: 565
Reputation: 1419
@unutbu has correct anwser. Your HTML is not valid and html parser will produce unexpected results. As it's said in the lxml docs,
The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.
Depending on what you are trying to achieve, you can fallback to xml parser
# Changing html to etree here will produce behaviour you expect
tree = etree.fromstring(text)
or move to more advanced website parsing package such as BeautifulSoup4, for example
Upvotes: 2
Reputation: 880717
ul
elements (or more generally flow content) are not allowed inside p
elements (which can only contain phrasing content). Therefore lxml.html
parses text
as
In [45]: print(html.tostring(tree))
<div><p class="goal">
<strong>Goal</strong> <br>
</p><ul><li>test</li></ul>
</div>
The ul
follows the p
element. So you could find the ul
element using the XPath
In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>
Upvotes: 2