ngoue
ngoue

Reputation: 1055

xpath regex doesn't search tail in lxml.etree

I'm working with lxml.etree and I'm trying to allow users to search a docbook for text. When a user provides the search text, I use the exslt match function to find the text within the docbook. The match works just fine if the text shows up within the element.text but not if the text is in element.tail.

Here's an example:

>>> # XML as lxml.etree element
>>> root = lxml.etree.fromstring('''
...   <root>
...     <foo>Sample text
...       <bar>and more sample text</bar> and important text.
...     </foo>
...   </root>
... ''')
>>>
>>> # User provides search text    
>>> search_term = 'important'
>>>
>>> # Find nodes with matching text
>>> matches = root.xpath('//*[re:match(text(), $search, "i")]', search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})
>>> print(matches)
[]
>>>
>>> # But I know it's there...
>>> bar = root.xpath('//bar')[0]
>>> print(bar.tail)
 and important text.

I'm confused because the text() function by itself returns all the text – including the tail:

>>> # text() results
>>> text = root.xpath('//child1/text()')
>>> print(text)
['Sample text',' and important text']

How come the tail isn't being included when I use the match function?

Upvotes: 1

Views: 644

Answers (1)

har07
har07

Reputation: 89285

How come the tail isn't being included when I use the match function?

That's because in xpath 1.0, when given a node-set, match() function (or any other string function such as contains(), starts-with(), etc.) only take into account the first node.

Instead of what you did, you can use //text() and apply regex match filter on individual text nodes, and then return the text node's parent element, like so :

xpath = '//text()[re:match(., $search, "i")]/parent::*'
matches = root.xpath(xpath, search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})

Upvotes: 2

Related Questions