Reputation: 1542
So I have a string like this:
A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
Is it possible to parse this with a regex so I get this output (a list):
[A, B, C, [D, E], F, [G]]
Basically I'm looking for a way to split the string by ! and by the tag...and the tag part can happen anywhere...and multiple times (but not recursively...meaning a tag within a tag...this doesn't happen). The whole thing seems regular...is this even possible to do with regex?
EDIT: I am using Python
EDIT2: I am only using A, B, C... as a representation...those can be any string made out of letters and numbers
Upvotes: 2
Views: 192
Reputation: 27575
If all the conditions I understood are verified (for exemple: there is no character on a line before '<tag>'
or before '</tag>'
; right ?) , the following code does the job, I think:
import re
RE = ('(\A\n*<tag>\n+)',
'(\A\n*)',
'(!\n*</tag>(?!\n*\Z)\n*)',
'(!\n*</tag>\n*\Z)',
'(!\n*<tag>\n+)',
'(!\n*\Z)',
'(!\n+)')
pat = re.compile('|'.join(RE))
def repl(mat, d = {1:"[['", 2:"['", 3:"'],'", 4:"']]", 5:"',['", 6:"']", 7:"','"}):
return d[mat.lastindex]
ch = .... # a string to parse
dh = eval(pat.sub(repl,ch))
applying:
ch1 = '''
A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
'''
ch2 = '''A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
H!
'''
ch3 = '''
A!
B!
C!
<tag>
D!
E!
</tag>
Fududu!gutuyu!!
<tag>
G!
</tag>
H!'''
ch4 = '''<tag>
A!
B!
</tag>
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
H!'''
import re
RE = ('(\A\n*<tag>\n+)',
'(\A\n*)',
'(!\n*</tag>(?!\n*\Z)\n*)',
'(!\n*</tag>\n*\Z)',
'(!\n*<tag>\n+)',
'(!\n*\Z)',
'(!\n+)')
pat = re.compile('|'.join(RE))
def repl(mat, d = {1:"[['", 2:"['", 3:"'],'", 4:"']]", 5:"',['", 6:"']", 7:"','"}):
return d[mat.lastindex]
for ch in (ch1,ch2,ch3,ch4):
print ch
dh = eval(pat.sub(repl,ch))
print dh,'\n',type(dh)
print '\n\n============================='
result
>>>
A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
['A', 'B', 'C', ['D', 'E'], 'F', ['G']]
<type 'list'>
=============================
A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
H!
['A', 'B', 'C', ['D', 'E'], 'F', ['G'], 'H']
<type 'list'>
=============================
A!
B!
C!
<tag>
D!
E!
</tag>
Fududu!gutuyu!!
<tag>
G!
</tag>
H!
['A', 'B', 'C', ['D', 'E'], 'Fududu!gutuyu!', ['G'], 'H']
<type 'list'>
=============================
<tag>
A!
B!
</tag>
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>
H!
[['A', 'B'], 'C', ['D', 'E'], 'F', ['G'], 'H']
<type 'list'>
=============================
>>>
Upvotes: 0
Reputation: 7103
Here is my solution to the problem. It uses regexp and some operations on the list.
import re
my_str = "A!\nB!\n<tag>\nC!\n</tag>\nD!\nE!\n<tag>\nF!\nG!\n</tag>\nH!\n"
x = re.findall("^(?:.|\n)+?(?=\n<tag>)",str) + re.findall("(?<=</tag>\n)(?:.|\n)+?(?=\n<tag>\n)",str) + re.findall("(?<=>\n)(?:[^>]|\n)+(?=\n)$",my_str)
y =[]
for elem in x:
y += elem.split('\n')
x = re.findall("((?<=<tag>\n)(?:.|\n)+?(?=\n</tag>\n))",my_str)
for elem in x:
y.append(elem.split('\n'))
print y
it produces the output
['A!', 'B!', 'D!', 'E!', 'H!', ['C!'], ['F!', 'G!']]
I didn't have much time to test it, though.
I do not think there is an easier way of doing this, as there is no recursive regexp in Python, see SO thread.
Have a good night (my time-zone). ;)
Note: probably it could have been made nicer by including everything in one regexp by using xor (see XOR in Regexp) but I think it would loose readability.
Upvotes: 0
Reputation: 50995
from collections import deque
from types import StringTypes
s = "A!\nB!\nC!\n<tag>\nD!\nE!\n</tag>\nF!\n<tag>\nG!\n</tag>"
def parse(parts):
if type(parts) in StringTypes:
parts = deque(parts.split("\n"))
ret = []
while parts:
part = parts.popleft()
if part[-1] == "!":
ret.append(part[:-1])
elif part == "<tag>":
ret.append(parse(parts))
elif part == "</tag>":
return ret
return ret
print parse(s)
I use a deque for speed because pop(0) would be very slow, and reversing the list and using pop() would make the function harder to read and understand.
I dare anyone to create a regexp doing the same, while also improving clarity!
(BTW, I think you could also use the pyparsing module to solve this problem, since it supports recursion.)
EDIT: Changed function to expect either string or deque as argument, simplifying invocation.
Upvotes: 0
Reputation: 7103
Wouldn't it be easier to just replace <tag>
with [
and </tag>
with ]
, and !\n
with ,
, and at the end embrace everything in one more pair of []
?
Upvotes: 1
Reputation: 30760
I don't know Python, but you can do this with three simple regex-replaces (possibly doable as a single regex, but the following should work fine).
Javascript version:
str = '[' + str.replace(/!\n/, ', ').replace(/<[^\/>]*>/, '[').replace(/<\/[^>]*>/, ']') + ']';
Hopefully that will be understandable enough to translate to Python.
Edit: Are you looking for array output? I thought your sample output was a literal string, but now I think that was meant to represent a nested array.
Upvotes: 1
Reputation: 168685
Yes it is possible.
To generate a flat array, your regex would be quite hairy, involving backtracking. It would be very similar to a regex for splitting a CSV file while allowing quoted strings, where the <tag>
/ </tag>
markers take the place of the quote marks, and the !
takes the place of the comma.
But you asked for a nested array structure, and in fact that makes things easier.
In order to get the nested array structure, you're going to need to do two separate split operations, which means doing two separate regex operations. You could do the first one as described above, but in fact, having to do two separate operations actually makes it easier for you because you can split out the sections embedded in the <tag>
s in the first pass, and since you say there's no nested tags, that means you don't need to do any complex regex back-tracking.
Hope that helps.
Upvotes: 0