Johsh Hanks
Johsh Hanks

Reputation: 147

Parsing data between multiple brackets and different kind of brackets

I am trying to parse data that located within three types of brackets. {}, [], and <>

The tricky part is that data can be nested between multiple brackets.

Easy case:

element-type { a | b | c | d} [ duration ]

This a simple case, and I can just use regex to grab things between {} and []

Here is a harder example

tcp [ udp-mode { off | on { encrypted | plain }}]

What I would essentially want to do here is extract "udp-mode" and then extract possible options and the options options. I am planning to store this in a tree structure. I am struggling to figure out how to go about parsing this? The tricky part is the nesting {}, as it could nest even further.

If its curly brackets I would want to grab everything before the curly brackets, until I hit either another closing bracket, or start of the string.

For square brackets I would want to grab everything inside of it. But again they can contain curly brackets or <>

For <> I would want to grab everything inside, and everything before, until hit a bracket or start of string

What is a good way to parse this kind of data?

Upvotes: 0

Views: 368

Answers (1)

PaulMcG
PaulMcG

Reputation: 63709

Here is a go at your problem, using pyparsing:

from pyparsing import *

LBRACK,RBRACK,LBRACE,RBRACE,LANGLE,RANGLE,VERT_BAR = map(Suppress,"[]{}<>|")

expr = Forward()

ident = Word(alphas, alphanums+'-_')

optional_expr = (LBRACK + expr + RBRACK)
reqd_expr = (LBRACE + expr + RBRACE)
user_expr = (LANGLE + OneOrMore(ident) + RANGLE)

term = ident | optional_expr | reqd_expr | user_expr
term = Group(term * (2,None)) | term

expr <<= OneOrMore(term + ~VERT_BAR | Group(delimitedList(term,VERT_BAR)))


tests = """\
element-type { a | b | c | d} [ duration ]
tcp [ udp-mode { off | on { encrypted | plain }}]""".splitlines()

for t in tests:
    print t
    print expr.parseString(t).asList()[0]
    print

Prints:

element-type { a | b | c | d} [ duration ]
['element-type', ['a', 'b', 'c', 'd'], 'duration']

tcp [ udp-mode { off | on { encrypted | plain }}]
['tcp', ['udp-mode', ['off', ['on', ['encrypted', 'plain']]]]]

To see how the different groupings are interpreted, I add parse-time actions to decorate the returned groups:

def make_pa(prefix):
    def pa(tokens):
        return ParseResults([prefix] + [tokens])
    return pa

optional_expr = (LBRACK + expr + RBRACK).setParseAction(make_pa("OPT:"))
reqd_expr = (LBRACE + expr + RBRACE).setParseAction(make_pa("REQD:"))
user_expr = (LANGLE + OneOrMore(ident) + RANGLE).setParseAction(make_pa("USER:"))

term = ident | optional_expr | reqd_expr | user_expr
term = Group(term * (2,None)) | term
alternation = Group(term + OneOrMore(VERT_BAR + term))
alternation.setParseAction(make_pa("OR:"))

And your two test strings return the following nested lists:

element-type { a | b | c | d} [ duration ]
['element-type', 'REQD:', ['OR:', [['a', 'b', 'c', 'd']]], 'OPT:', ['duration']]

tcp [ udp-mode { off | on { encrypted | plain }}]
['tcp', 'OPT:', [['udp-mode', 'REQD:', ['OR:', [['off', ['on', 'REQD:', ['OR:', [['encrypted', 'plain']]]]]]]]]]

Upvotes: 1

Related Questions