Reputation: 21

Python regexp matching or tokenizing

I have a dump of a data structure which i'm trying to convert into an XML. the structure has a number of nested structures within it. So i'm kind of lost on how to start because all the regex expressions that i can think of will not work on nested expressions.

For example, let's say there is a structure dump like this:

abc = (  
        bcd = (efg = 0, ghr = 5, lmn = 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))

and i want to come out with an output like this:

< abc >
  < bcd >   
    < efg >0< /efg >  
    < ghr >5< /ghr >  
    < lmn >10< /lmn >  
  < /bcd >  
.....  
< /abc >

So what would be a good approach to this? tokenizing the expression, a clever regex or using a stack?

Upvotes: 2

Answers (5)

Jonathan Eunice

Reputation: 22463

Here is an alternate answer that uses pyparsing more idiomatically. Because it provides a detailed grammar for what inputs may be seen and what results should be returned, parsed data is not "messy." Thus toXML() needn't work as hard nor do any real cleanup.

print "\n----- ORIGINAL -----\n"

dump = """
abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
""".strip()

print dump


print "\n----- PARSED INTO LIST -----\n"

from pyparsing import Word, alphas, nums, Optional, Forward, delimitedList, Group, Suppress

def Syntax():
    """Define grammar and parser."""

    # building blocks
    name   = Word(alphas)
    number = Word(nums)
    _equals = Optional(Suppress('='))
    _lpar   = Suppress('(')
    _rpar   = Suppress(')')

    # larger constructs
    expr = Forward()
    value = number | Group( _lpar + delimitedList(expr) + _rpar )
    expr << name + _equals + value

    return expr

parsed = Syntax().parseString(dump)
print parsed


print "\n----- SERIALIZED INTO XML ----\n"


def toXML(part, level=0):

    xml = ""
    indent = "    " * level
    while part:
        tag     = part.pop(0)
        payload = part.pop(0)

        insides = payload if isinstance(payload, str) \
                          else "\n" + toXML(payload, level+1) + indent

        xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())

    return xml

print toXML(parsed)

The input and XML output is the same as my other answer. The data returned by parseString() is the only real change:

----- PARSED INTO LIST -----

['abc', ['bcd', ['efg', '0', 'ghr', '5', 'lmn', '10'], 'ghd', '5', 'zde',
['dfs', '10', 'fge', '20', 'dfg', ['sdf', '3', 'ert', '5'], 'juh', '0']]]

Upvotes: 1

jfs

Reputation: 414265

You can use re module to parse nested expressions (though it is not recommended):

import re

def repl_flat(m):
    return "\n".join("<{0}>{1}</{0}>".format(*map(str.strip, s.partition('=')[::2]))
                     for s in m.group(1).split(','))

def eval_nested(expr):
    val, n = re.subn(r"\(([^)(]+)\)", repl_flat, expr)
    return val if n == 0 else eval_nested(val)

Example

print eval_nested("(%s)" % (data,))

Output

<abc><bcd><efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn></bcd>
<ghd>5</ghd>
<zde><dfs>10</dfs>
<fge>20</fge>
<dfg><sdf>3</sdf>
<ert>5</ert></dfg>
<juh>0</juh></zde></abc>

Upvotes: 0

Jonathan Eunice

Reputation: 22463

I like Igor Chubin's "use pyparsing" answer, because in general, regexps handle nested structures very poorly (though thg435's iterative replacement solution is a clever workaround).

But once pyparsing's done its thing, you then need a routine to walk the list and emit XML. It needs to be intelligent about the imperfections of pyparsing's results. For example, fge =20, doesn't yield the ['fge', '=', '20'] you'd like, but ['fge', '=20,']. Commas are sometimes also added in places that are unhelpful. Here's how I did it:

from pyparsing import nestedExpr

dump = """
abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
"""

dump = dump.strip()

print "\n----- ORIGINAL -----\n"
print dump

wrapped = dump if dump.startswith('(') else "({})".format(dump)
parsed = nestedExpr().parseString(wrapped).asList()

print "\n----- PARSED INTO LIST -----\n"
print parsed

def toXML(part, level=0):

    def grab_tag():
        return part.pop(0).lstrip(",")

    def grab_payload():
        payload = part.pop(0)
        if isinstance(payload, str):
            payload = payload.lstrip("=").rstrip(",")
        return payload

    xml = ""
    indent = "    " * level
    while part:
        tag     = grab_tag() or grab_tag()
        payload = grab_payload() or grab_payload()
        # grab twice, possibly, if '=' or ',' is in the way of what you're grabbing

        insides = payload if isinstance(payload, str) \
                          else "\n" + toXML(payload, level+1) + indent

        xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())

    return xml

print "\n----- SERIALIZED INTO XML ----\n"
print toXML(parsed[0])

Resulting in:

----- ORIGINAL -----

abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))

----- PARSED INTO LIST -----

[['abc', '=', ['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]]

----- SERIALIZED INTO XML ----

<abc>
    <bcd>
        <efg>0</efg>
        <ghr>5</ghr>
        <lmn>10</lmn>
    </bcd>
    <ghd>5</ghd>
    <zde>
        <dfs>10</dfs>
        <fge>20</fge>
        <dfg>
            <sdf>3</sdf>
            <ert>5</ert>
        </dfg>
        <juh>0</juh>
    </zde>
</abc>

Upvotes: 0

georg

Reputation: 214969

I don't think regexps is the best approach here, but for those curious it can be done like this:

def expr(m):
    out = []
    for item in m.group(1).split(','):
        a, b = map(str.strip, item.split('='))
        out.append('<%s>%s</%s>' % (a, b, a))
    return '\n'.join(out)

rr = r'\(([^()]*)\)'
while re.search(rr, data):
    data = re.sub(rr, expr, data)

Basically, we repeatedly replace lowermost parenthesis (no parens here) with chunks of xml until there's no more parenthesis. For simplicity, I also included the main expression in parenthesis, if this is not the case, just do data='(%s)' % data before parsing.

Upvotes: 0

Igor Chubin

Reputation: 64563

Use pyparsing.

$ cat parsing.py 
from pyparsing import nestedExpr

abc = """(  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))"""
print nestedExpr().parseString(abc).asList()

$ python parsing.py
[['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]

Upvotes: 3

Python regexp matching or tokenizing

Answers (5)

Example

Output

Related Questions