Akarsh Jain
Akarsh Jain

Reputation: 1000

Need in depth absolute xpath of the text also using lxml getpath function

In R I can get the desired results.

library(xml2)
root = read_html("<div><p>abc<br> xyz</p></div>")
elements = xml_find_all(root, "//.")
xml_path(elements)
#> [1] "/"                          "/html"                     
#> [3] "/html/body"                 "/html/body/div"            
#> [5] "/html/body/div/p"           "/html/body/div/p/text()[1]"
#> [7] "/html/body/div/p/br"        "/html/body/div/p/text()[2]"

The nodes
(/html/body/div/p/text()[1], /html/body/div/p/text()[2]) are desired.

In python when I use lxml's getpath I get an error because some bare pieces of text elements are also returned along with node elements.

root = html.fromstring("<div><p>abc<br> xyz</p></div>")
elements = root.xpath("//.")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]

But when I use using xpath ending with nodes I won't get the same results as I get using R's xml2

root = html.fromstring("<div><p>abc<br> xyz</p></div>")
elements = root.xpath("//*")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]
print(xpath_elements)  

#> ['/html', '/html/body', '/html/body/div', '/div/p', '/div/p/br']

How can I produce the the desired xpath results as R's xml2 library produces.

Upvotes: 1

Views: 278

Answers (1)

Andersson
Andersson

Reputation: 52665

In lxml root.xpath(XPATH) returns text nodes as string, not as Element object.

You can try below workaround (it still won't work in the same way as on R):

elements = root.xpath("//*[text()]")
xpath_elements = []
for element in elements:
    for text_node in list(element.itertext()):
        if text_node.strip():
            xpath_elements.append(etree.ElementTree(root).getpath(element) + "/text()[%d]" % (list(element.itertext()).index(text_node) + 1))

print(xpath_elements)  # ['/div/p/text()[1]', '/div/p/text()[2]']

P.S. As list.index(element) returns the index of first element occurrence, this will not work for node with exactly the same text nodes, e.g. <p>QWERTY<br>QWERTY</p>. This is kinda extremely rare case, but let me know if you need to handle such cases also

Upvotes: 1

Related Questions