Reputation: 1504
I'm generating a model to find out where a piece of text is located in an HTML file.
So, I have a database with plenty of data from different newspaper's articles with data like title, publish date, authors and news text. What I'm trying to do is by analyzing this data, generate a model that can find by itself the XPath to the HTML tags with this content.
The problem is when I use a regex within the xpath method as shown here:
from lxml import html
with open('somecode.html', 'r') as f:
root = html.fromstring(f.read())
list_of_xpaths = root.xpath('//*/@*[re:match(.,"2019-04-15")]')
This is an example of searching for the publish date in the code. It returns a lxml.etree._ElementUnicodeResult instead of lxml.etree._Element.
Unfortunately, this type of element doesn't let me get the XPath to where is it locate like an lxml.etree._Element after applying root.getroottree().getpath(list_of_xpaths[0])
.
Is there a way to get the XPath for this type of element? How?
Is there a way to lxml with regex return an lxml.etree._ElementUnicodeResult element instead?
Upvotes: 1
Views: 251
Reputation: 473893
The problem is that you get an attribute value represented as an instance of _ElementUnicodeResult
class.
If we introspect what _ElementUnicodeResult
class provides, we could see that it allows you to get to the element which has this attribute via .getparent()
method:
attribute = list_of_xpaths[0]
element = attribute.getparent()
print(root.getroottree().getpath(element))
This would get us a path to the element, but as we need an attribute name as well, we could do:
print(attribute.attrname)
Then, to get the complete xpath pointing at the element attribute, we may use:
path_to_element = root.getroottree().getpath(element)
attribute_name = attribute.attrname
complete_path = path_to_element + "/@" + attribute_name
print(complete_path)
FYI, _ElementUnicodeResult
also indicates if this is actually an attribute via .is_attribute
property (as this class also represents text nodes and tails as well).
Upvotes: 1