Trindaz
Trindaz

Reputation: 17839

lxml bug in .xpath?

After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.

Is my understanding of XPath all wrong or does lxml indeed have a bug?

The script to replicate the behaviors I'm talking about:

from lxml.html.soupparser import fromstring
doc = fromstring("""
    <html>
        <head></head>
        <body>
            <p>Paragraph 1</p>
            <p>Paragraph 2</p>
        </body>
    </html>
""")

print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)

print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)   

print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]

Upvotes: 0

Views: 906

Answers (3)

Steven D. Majewski
Steven D. Majewski

Reputation: 2157

In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.

But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.

Upvotes: 0

Kien Truong
Kien Truong

Reputation: 11383

It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard

Upvotes: 3

Pratyush
Pratyush

Reputation: 5478

All p tags should be doc.findall(".//p")

As per guide, expression nodename Selects all child nodes of the named node. Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).

Upvotes: 0

Related Questions