Python lxml's XPath not finding
in
tags

Question

I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:

from lxml import html, etree

text = """
      
            Goal 

            
test
        
"""

tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)

Running the code produces


            Goal

As you can see, the entire

//p[@class="goal"]/ul

.

Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the

? The thing is embedded in a larger website, and it is not guaranteed that there will even be a

inside, or anything else, for that matter).

Update: Updated title after answer was received to make finding this question easier for people with the same problem.

unutbu · Accepted Answer

ul elements (or more generally flow content) are not allowed inside p elements (which can only contain phrasing content). Therefore lxml.html parses text as

In [45]: print(html.tostring(tree))

            Goal 

            
test

The ul follows the p element. So you could find the ul element using the XPath

In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
test

Python lxml's XPath not finding <ul> in <p> tags

Answers (2)

Related Questions

Python lxml&#39;s XPath not finding &lt;ul&gt; in &lt;p&gt; tags

Answers (2)

Related Questions

Python lxml's XPath not finding <ul> in <p> tags