lxml find tags by regex

Question

I'm trying to use lxml to get an array of tags that are formatted as

TEXT

TEXT

TEXT

I tried using

xml_file.findall("TEXT*")

but this searches for a literal asterisk.

I've also try to use ETXPath but it seems to not work. Is there any API function to work with that, because assuming that TEXT is append by integers isn't the prettiest solution.

Robᵩ · Accepted Answer

Yes, you can use regular expressions in lxml xpath.

Here is one example:

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

Of course, in the example you mention you don't really need a regular expression. You could use the starts-with() xpath function:

results = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

Complete program:

from lxml import etree

root = etree.XML('''
    
      one
      two
      three
      but never four
    ''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

Addressing a new requirement, consider this XML:


   
      one
      two
      three
   
   
      do not want to found one
      do not want to found two
      do not want to found three

If one wants to find all TEXT elements that are immediate children of a element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to all TEXT elements that are immediate children of only the first tag element:

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to find only the first TEXT element of each tag element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

Resorources:

lxml find tags by regex

Answers (2)

Related Questions