Need in depth absolute xpath of the text also using lxml getpath function

Question

In R I can get the desired results.

library(xml2)
root = read_html("abc
 xyz")
elements = xml_find_all(root, "//.")
xml_path(elements)
#> [1] "/"                          "/html"                     
#> [3] "/html/body"                 "/html/body/div"            
#> [5] "/html/body/div/p"           "/html/body/div/p/text()[1]"
#> [7] "/html/body/div/p/br"        "/html/body/div/p/text()[2]"

The nodes
(/html/body/div/p/text()[1], /html/body/div/p/text()[2]) are desired.

In python when I use lxml's getpath I get an error because some bare pieces of text elements are also returned along with node elements.

root = html.fromstring("abc
 xyz")
elements = root.xpath("//.")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]

But when I use using xpath ending with nodes I won't get the same results as I get using R's xml2

root = html.fromstring("abc
 xyz")
elements = root.xpath("//*")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]
print(xpath_elements)  

#> ['/html', '/html/body', '/html/body/div', '/div/p', '/div/p/br']

How can I produce the the desired xpath results as R's xml2 library produces.

Andersson · Accepted Answer

In lxml root.xpath(XPATH) returns text nodes as string, not as Element object.

You can try below workaround (it still won't work in the same way as on R):

elements = root.xpath("//*[text()]")
xpath_elements = []
for element in elements:
    for text_node in list(element.itertext()):
        if text_node.strip():
            xpath_elements.append(etree.ElementTree(root).getpath(element) + "/text()[%d]" % (list(element.itertext()).index(text_node) + 1))

print(xpath_elements)  # ['/div/p/text()[1]', '/div/p/text()[2]']

P.S. As list.index(element) returns the index of first element occurrence, this will not work for node with exactly the same text nodes, e.g.

QWERTY QWERTY

. This is kinda extremely rare case, but let me know if you need to handle such cases also

Need in depth absolute xpath of the text also using lxml getpath function

Answers (1)

Related Questions