Reputation: 21
I have a dump of a data structure which i'm trying to convert into an XML. the structure has a number of nested structures within it. So i'm kind of lost on how to start because all the regex expressions that i can think of will not work on nested expressions.
For example, let's say there is a structure dump like this:
abc = (
bcd = (efg = 0, ghr = 5, lmn = 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
and i want to come out with an output like this:
< abc >
< bcd >
< efg >0< /efg >
< ghr >5< /ghr >
< lmn >10< /lmn >
< /bcd >
.....
< /abc >
So what would be a good approach to this? tokenizing the expression, a clever regex or using a stack?
Upvotes: 2
Views: 264
Reputation: 22463
Here is an alternate answer that uses pyparsing more idiomatically. Because it provides a detailed grammar for what inputs may be seen and what results should be returned, parsed data is not "messy." Thus toXML()
needn't work as hard nor do any real cleanup.
print "\n----- ORIGINAL -----\n"
dump = """
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
""".strip()
print dump
print "\n----- PARSED INTO LIST -----\n"
from pyparsing import Word, alphas, nums, Optional, Forward, delimitedList, Group, Suppress
def Syntax():
"""Define grammar and parser."""
# building blocks
name = Word(alphas)
number = Word(nums)
_equals = Optional(Suppress('='))
_lpar = Suppress('(')
_rpar = Suppress(')')
# larger constructs
expr = Forward()
value = number | Group( _lpar + delimitedList(expr) + _rpar )
expr << name + _equals + value
return expr
parsed = Syntax().parseString(dump)
print parsed
print "\n----- SERIALIZED INTO XML ----\n"
def toXML(part, level=0):
xml = ""
indent = " " * level
while part:
tag = part.pop(0)
payload = part.pop(0)
insides = payload if isinstance(payload, str) \
else "\n" + toXML(payload, level+1) + indent
xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())
return xml
print toXML(parsed)
The input and XML output is the same as my other answer. The data returned by parseString()
is the only real change:
----- PARSED INTO LIST -----
['abc', ['bcd', ['efg', '0', 'ghr', '5', 'lmn', '10'], 'ghd', '5', 'zde',
['dfs', '10', 'fge', '20', 'dfg', ['sdf', '3', 'ert', '5'], 'juh', '0']]]
Upvotes: 1
Reputation: 414265
You can use re
module to parse nested expressions (though it is not recommended):
import re
def repl_flat(m):
return "\n".join("<{0}>{1}</{0}>".format(*map(str.strip, s.partition('=')[::2]))
for s in m.group(1).split(','))
def eval_nested(expr):
val, n = re.subn(r"\(([^)(]+)\)", repl_flat, expr)
return val if n == 0 else eval_nested(val)
print eval_nested("(%s)" % (data,))
<abc><bcd><efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn></bcd>
<ghd>5</ghd>
<zde><dfs>10</dfs>
<fge>20</fge>
<dfg><sdf>3</sdf>
<ert>5</ert></dfg>
<juh>0</juh></zde></abc>
Upvotes: 0
Reputation: 22463
I like Igor Chubin's "use pyparsing" answer, because in general, regexps handle nested structures very poorly (though thg435's iterative replacement solution is a clever workaround).
But once pyparsing's done its thing, you then need a routine to walk the list and emit XML. It needs to be intelligent about the imperfections of pyparsing's results. For example, fge =20,
doesn't yield the ['fge', '=', '20']
you'd like, but ['fge', '=20,']
. Commas are sometimes also added in places that are unhelpful. Here's how I did it:
from pyparsing import nestedExpr
dump = """
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
"""
dump = dump.strip()
print "\n----- ORIGINAL -----\n"
print dump
wrapped = dump if dump.startswith('(') else "({})".format(dump)
parsed = nestedExpr().parseString(wrapped).asList()
print "\n----- PARSED INTO LIST -----\n"
print parsed
def toXML(part, level=0):
def grab_tag():
return part.pop(0).lstrip(",")
def grab_payload():
payload = part.pop(0)
if isinstance(payload, str):
payload = payload.lstrip("=").rstrip(",")
return payload
xml = ""
indent = " " * level
while part:
tag = grab_tag() or grab_tag()
payload = grab_payload() or grab_payload()
# grab twice, possibly, if '=' or ',' is in the way of what you're grabbing
insides = payload if isinstance(payload, str) \
else "\n" + toXML(payload, level+1) + indent
xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())
return xml
print "\n----- SERIALIZED INTO XML ----\n"
print toXML(parsed[0])
Resulting in:
----- ORIGINAL -----
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
----- PARSED INTO LIST -----
[['abc', '=', ['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]]
----- SERIALIZED INTO XML ----
<abc>
<bcd>
<efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn>
</bcd>
<ghd>5</ghd>
<zde>
<dfs>10</dfs>
<fge>20</fge>
<dfg>
<sdf>3</sdf>
<ert>5</ert>
</dfg>
<juh>0</juh>
</zde>
</abc>
Upvotes: 0
Reputation: 214969
I don't think regexps is the best approach here, but for those curious it can be done like this:
def expr(m):
out = []
for item in m.group(1).split(','):
a, b = map(str.strip, item.split('='))
out.append('<%s>%s</%s>' % (a, b, a))
return '\n'.join(out)
rr = r'\(([^()]*)\)'
while re.search(rr, data):
data = re.sub(rr, expr, data)
Basically, we repeatedly replace lowermost parenthesis (no parens here)
with chunks of xml until there's no more parenthesis. For simplicity, I also included the main expression in parenthesis, if this is not the case, just do data='(%s)' % data
before parsing.
Upvotes: 0
Reputation: 64563
Use pyparsing.
$ cat parsing.py
from pyparsing import nestedExpr
abc = """(
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))"""
print nestedExpr().parseString(abc).asList()
$ python parsing.py
[['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]
Upvotes: 3