WhiteDragon
WhiteDragon

Reputation: 479

lxml find tags by regex

I'm trying to use lxml to get an array of tags that are formatted as

<TEXT1>TEXT</TEXT1>

<TEXT2>TEXT</TEXT2>

<TEXT3>TEXT</TEXT3>

I tried using

xml_file.findall("TEXT*")

but this searches for a literal asterisk.

I've also try to use ETXPath but it seems to not work. Is there any API function to work with that, because assuming that TEXT is append by integers isn't the prettiest solution.

Upvotes: 4

Views: 6799

Answers (2)

Robᵩ
Robᵩ

Reputation: 168586

Yes, you can use regular expressions in lxml xpath.

Here is one example:

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

Of course, in the example you mention you don't really need a regular expression. You could use the starts-with() xpath function:

results = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

Complete program:

from lxml import etree

root = etree.XML('''
    <root>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
      <x-TEXT4>but never four</x-TEXT4>
    </root>''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

Addressing a new requirement, consider this XML:

<root>
   <tag>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
   </tag>
   <other_tag>
      <TEXT1>do not want to found one</TEXT1>
      <TEXT2>do not want to found two</TEXT2>
      <TEXT3>do not want to found three</TEXT3>
   </other_tag>
</root>

If one wants to find all TEXT elements that are immediate children of a <tag> element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to all TEXT elements that are immediate children of only the first tag element:

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to find only the first TEXT element of each tag element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

Resorources:

Upvotes: 10

JKesMc9tqIQe9M
JKesMc9tqIQe9M

Reputation: 103

Here's one idea:

import lxml.etree

doc = lxml.etree.parse('test.xml')
elements = [x for x in doc.xpath('//*') if x.tag.startswith('TEXT')]

Upvotes: 2

Related Questions