Reputation: 17839
After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]
Upvotes: 0
Views: 906
Reputation: 2157
In fact doc.xpath("//html[1]")
can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1]
forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.
Upvotes: 0
Reputation: 11383
It's because the context node of doc
is 'html'
node. When you use doc.xpath('body')
it select the child element 'body'
of 'html'
. This conforms XPath 1.0 standard
Upvotes: 3
Reputation: 5478
All p tags should be doc.findall(".//p")
As per guide, expression nodename
Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).
Upvotes: 0