mrCarnivore
mrCarnivore

Reputation: 5068

Finding two matches with the same string in regex

Is there a way in regex to find a string if it occurs twice in given structures (i.e. like in XML parsing)? This code obviously does not work as it finds the first tag and then the last closing tag:

re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)

So is there a way to tell regex that the third match should be the same as the first?

Full code:

import re

s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

matches = re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
for match in matches:
    print(match)

Result should be all the tags with the contents:

    [('a1', '\n    <a2>\n        1\n    </a2>\n    <b2>\n        52\n    </b2>\n    <c2>\n        <a3>\n            Abc\n        </a3>\n    </c2>\n'),
     ('a2', '\n        1\n    '),
      ...]

Note: I am not looking for a complete xml parsing package. The question is specificly about solving the given problem with regex.

Upvotes: 1

Views: 97

Answers (3)

mrCarnivore
mrCarnivore

Reputation: 5068

Using the help danihp gave me in the answer and obeying the hint DDeMartini gave in the comment I was able to create a lightweight xml parser that returns a dict structure of the xml:

import re

def xml_loads(xml_text):
    matches = re.findall(r'<([^<>]+)>([\s\S]*)</(\1)>', xml_text)
    if not matches:
        return xml_text.strip()
    d = {}
    for k, s2, _ in matches:
        d[k] = xml_loads(s2)
    return d


s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

d = xml_loads(s)
print(d)

Upvotes: 0

Tamas Rev
Tamas Rev

Reputation: 7166

I wouldn't do this because the recursive structures are difficult to parse with regexes. Python's re module doesn't support this. The alternative regex module does. However, I wouldn't do it.

A backreference can only bring you this far:

import re

s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

matches = re.findall(r'<(.+)>([\s\S]*)</\1>', s) # mind the \1
for match in matches:
    print(match)

It will give you two matches: 1. the <a1> ... </a1> and <b1> ... </b1>.

Now imagine that some of your tags are having attributes. What if a tag can span more than one line? What about tags that close themselves? What about accidental spaces?

A html / xml parser can deal with all of this.

Upvotes: 1

dani herrera
dani herrera

Reputation: 51645

You can use backreferences and simple recursion:

>>> def m(s):
...    matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
...    for k,s2,_ in matches:
...        print (k,s2)
...        m(s2)
... 
>>> m(s)
('a1', '\n    <a2>\n  ...[dropped]...      </a3>\n    </c2>\n')
('a2', '\n        1\n    ')
('b2', '\n        52\n    ')
('c2', '\n        <a3>\n            Abc\n        </a3>\n    ')
('a3', '\n            Abc\n        ')
('b1', '\n    21\n')

More about backreferences from Microsoft Docs.

Edited

For extra fun, with generator. Thanks @mrCarnivore about your suggestion to remove if s:

>>> def m(s):
...    matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
...    for k,s2,_ in matches:
...        yield (k,s2)
...        yield from m(s2)
... 
>>> for x in m(s):
...    x
... 
('a1', '\n    <a2>\ [....]     Abc\n        </a3>\n    </c2>\n')
('a2', '\n        1\n    ')
('b2', '\n        52\n    ')
('c2', '\n        <a3>\n            Abc\n        </a3>\n    ')
('a3', '\n            Abc\n        ')
('b1', '\n    21\n')

Upvotes: 4

Related Questions