Reputation: 1000
In R I can get the desired results.
library(xml2)
root = read_html("<div><p>abc<br> xyz</p></div>")
elements = xml_find_all(root, "//.")
xml_path(elements)
#> [1] "/" "/html"
#> [3] "/html/body" "/html/body/div"
#> [5] "/html/body/div/p" "/html/body/div/p/text()[1]"
#> [7] "/html/body/div/p/br" "/html/body/div/p/text()[2]"
The nodes
(/html/body/div/p/text()[1], /html/body/div/p/text()[2]) are desired.
In python when I use lxml's getpath I get an error because some bare pieces of text elements are also returned along with node elements.
root = html.fromstring("<div><p>abc<br> xyz</p></div>")
elements = root.xpath("//.")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]
But when I use using xpath ending with nodes I won't get the same results as I get using R's xml2
root = html.fromstring("<div><p>abc<br> xyz</p></div>")
elements = root.xpath("//*")
xpath_elements = [etree.ElementTree(root).getpath(x) for x in elements]
print(xpath_elements)
#> ['/html', '/html/body', '/html/body/div', '/div/p', '/div/p/br']
How can I produce the the desired xpath results as R's xml2 library produces.
Upvotes: 1
Views: 278
Reputation: 52665
In lxml
root.xpath(XPATH)
returns text nodes as string, not as Element object.
You can try below workaround (it still won't work in the same way as on R):
elements = root.xpath("//*[text()]")
xpath_elements = []
for element in elements:
for text_node in list(element.itertext()):
if text_node.strip():
xpath_elements.append(etree.ElementTree(root).getpath(element) + "/text()[%d]" % (list(element.itertext()).index(text_node) + 1))
print(xpath_elements) # ['/div/p/text()[1]', '/div/p/text()[2]']
P.S. As list.index(element)
returns the index of first element
occurrence, this will not work for node with exactly the same text nodes, e.g. <p>QWERTY<br>QWERTY</p>
. This is kinda extremely rare case, but let me know if you need to handle such cases also
Upvotes: 1