Reputation: 5068
Is there a way in regex to find a string if it occurs twice in given structures (i.e. like in XML parsing)? This code obviously does not work as it finds the first tag and then the last closing tag:
re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
So is there a way to tell regex that the third match should be the same as the first?
Full code:
import re
s = '''<a1>
<a2>
1
</a2>
<b2>
52
</b2>
<c2>
<a3>
Abc
</a3>
</c2>
</a1>
<b1>
21
</b1>'''
matches = re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
for match in matches:
print(match)
Result should be all the tags with the contents:
[('a1', '\n <a2>\n 1\n </a2>\n <b2>\n 52\n </b2>\n <c2>\n <a3>\n Abc\n </a3>\n </c2>\n'),
('a2', '\n 1\n '),
...]
Note: I am not looking for a complete xml parsing package. The question is specificly about solving the given problem with regex.
Upvotes: 1
Views: 97
Reputation: 5068
Using the help danihp
gave me in the answer and obeying the hint DDeMartini
gave in the comment I was able to create a lightweight xml parser that returns a dict structure of the xml:
import re
def xml_loads(xml_text):
matches = re.findall(r'<([^<>]+)>([\s\S]*)</(\1)>', xml_text)
if not matches:
return xml_text.strip()
d = {}
for k, s2, _ in matches:
d[k] = xml_loads(s2)
return d
s = '''<a1>
<a2>
1
</a2>
<b2>
52
</b2>
<c2>
<a3>
Abc
</a3>
</c2>
</a1>
<b1>
21
</b1>'''
d = xml_loads(s)
print(d)
Upvotes: 0
Reputation: 7166
I wouldn't do this because the recursive structures are difficult to parse with regexes. Python's re
module doesn't support this. The alternative regex
module does. However, I wouldn't do it.
A backreference can only bring you this far:
import re
s = '''<a1>
<a2>
1
</a2>
<b2>
52
</b2>
<c2>
<a3>
Abc
</a3>
</c2>
</a1>
<b1>
21
</b1>'''
matches = re.findall(r'<(.+)>([\s\S]*)</\1>', s) # mind the \1
for match in matches:
print(match)
It will give you two matches: 1. the <a1> ... </a1>
and <b1> ... </b1>
.
Now imagine that some of your tags are having attributes. What if a tag can span more than one line? What about tags that close themselves? What about accidental spaces?
A html / xml parser can deal with all of this.
Upvotes: 1
Reputation: 51645
You can use backreferences and simple recursion:
>>> def m(s):
... matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
... for k,s2,_ in matches:
... print (k,s2)
... m(s2)
...
>>> m(s)
('a1', '\n <a2>\n ...[dropped]... </a3>\n </c2>\n')
('a2', '\n 1\n ')
('b2', '\n 52\n ')
('c2', '\n <a3>\n Abc\n </a3>\n ')
('a3', '\n Abc\n ')
('b1', '\n 21\n')
More about backreferences from Microsoft Docs.
Edited
For extra fun, with generator. Thanks @mrCarnivore about your suggestion to remove if s
:
>>> def m(s):
... matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
... for k,s2,_ in matches:
... yield (k,s2)
... yield from m(s2)
...
>>> for x in m(s):
... x
...
('a1', '\n <a2>\ [....] Abc\n </a3>\n </c2>\n')
('a2', '\n 1\n ')
('b2', '\n 52\n ')
('c2', '\n <a3>\n Abc\n </a3>\n ')
('a3', '\n Abc\n ')
('b1', '\n 21\n')
Upvotes: 4