Applying root.xpath() with regex returns a lxml.etree._ElementUnicodeResult

Question

I'm generating a model to find out where a piece of text is located in an HTML file.

So, I have a database with plenty of data from different newspaper's articles with data like title, publish date, authors and news text. What I'm trying to do is by analyzing this data, generate a model that can find by itself the XPath to the HTML tags with this content.

The problem is when I use a regex within the xpath method as shown here:

from lxml import html

with open('somecode.html', 'r') as f:
    root = html.fromstring(f.read())

list_of_xpaths = root.xpath('//*/@*[re:match(.,"2019-04-15")]')

This is an example of searching for the publish date in the code. It returns a lxml.etree._ElementUnicodeResult instead of lxml.etree._Element.

Unfortunately, this type of element doesn't let me get the XPath to where is it locate like an lxml.etree._Element after applying root.getroottree().getpath(list_of_xpaths[0]).

Is there a way to get the XPath for this type of element? How?

Is there a way to lxml with regex return an lxml.etree._ElementUnicodeResult element instead?

alecxe · Accepted Answer

The problem is that you get an attribute value represented as an instance of _ElementUnicodeResult class.

If we introspect what _ElementUnicodeResult class provides, we could see that it allows you to get to the element which has this attribute via .getparent() method:

attribute = list_of_xpaths[0]
element = attribute.getparent()

print(root.getroottree().getpath(element))

This would get us a path to the element, but as we need an attribute name as well, we could do:

print(attribute.attrname)

Then, to get the complete xpath pointing at the element attribute, we may use:

path_to_element = root.getroottree().getpath(element)
attribute_name = attribute.attrname

complete_path = path_to_element + "/@" + attribute_name
print(complete_path)

FYI, _ElementUnicodeResult also indicates if this is actually an attribute via .is_attribute property (as this class also represents text nodes and tails as well).

Applying root.xpath() with regex returns a lxml.etree._ElementUnicodeResult

Answers (1)

Related Questions