Sid Shukla
Sid Shukla

Reputation: 1030

Python Regex for parsing a tree structure defined in a given BNF like format

I have a a txt file representing a tree that is defined in the following BNF like format:

branches : '(' <value1> <value2> <n_children> <branches> ')'
         | ''

Each node for the tree has two values value1 and value2 that are integers, number of children that is an integer, and branches. an example of a subset of the file looks like:

(1796161 2205411 3  
    (1796288 2205425 0  )
    (1811141 2205419 1  
        (1811652 2205480 1  
            (1812161 2205496 4  
                (1812288 2205521 1  
                    (1812415 2205526 0  ))
                (1812034 2205516 0  )
                (1827651 2205510 2  
                    (1827906 2205581 2  
                        (1843777 2205588 2  
                            (1844032 2205626 1  
                                (1844159 2205632 0  ))
                            (1843138 2205617 0  ))
                        (1828288 2205591 1  
                            (1828161 2205597 0  )))
                    (1827012 2205563 0  ))
                (1811907 2205511 0  ))))
    (1796034 2205420 0  ))

Is there a nice way to parse this data using regular expression(s) that wouldn't involve me manually reading the file character by character keeping track of all the parantheses and preserves the relationship (parent-child) ordering.

Upvotes: 1

Views: 775

Answers (1)

ThisSuitIsBlackNot
ThisSuitIsBlackNot

Reputation: 24083

This is easiest to do with a parser that lets you write your own grammar, like pyPEG. The following creates a tree of Node objects, where each Node can have zero or more children:

import re

from pypeg2 import attr, ignore, List, maybe_some, parse

Int = re.compile(r'[0-9]+')  # a positive integer

class Values(List):
    '''A pair of values associated with each node in the tree.

    For example, in the node

        ( 1 2 0 )

    the values are 1 and 2 (and 0 is the number of children).
    '''

    grammar = 2, Int

class Node(List):
    '''A node in the tree.

    Attributes:
        values      The pair of values associated with this node
        children    A list of child nodes
    '''

    def __repr__(self):
        return 'Values: ' + ', '.join(self.values)

# Grammar for Node is recursive (a Node can contain other Nodes),
# so we have to set it outside of the Node class definition.
Node.grammar = (
    '(',
    attr('values', Values),
    ignore(Int),
    attr('children', maybe_some(Node)),
    ')'
)

def print_tree(root, indent=0):
    '''Print a tree of Nodes recursively'''

    print(' ' * indent, end='')
    print(root)

    if len(root.children) > 0:
        for child in root.children:
            print_tree(child, indent + 2)


if __name__ == '__main__':

    tree = '''
        (1796161 2205411 3  
            (1796288 2205425 0  )
            (1811141 2205419 1  
                (1811652 2205480 1  
                    (1812161 2205496 4  
                        (1812288 2205521 1  
                            (1812415 2205526 0  ))
                        (1812034 2205516 0  )
                        (1827651 2205510 2  
                            (1827906 2205581 2  
                                (1843777 2205588 2  
                                    (1844032 2205626 1  
                                        (1844159 2205632 0  ))
                                    (1843138 2205617 0  ))
                                (1828288 2205591 1  
                                    (1828161 2205597 0  )))
                            (1827012 2205563 0  ))
                        (1811907 2205511 0  ))))
            (1796034 2205420 0  ))
    '''

    root = parse(tree, Node)
    print_tree(root)

Output:

Values: 1796161, 2205411
  Values: 1796288, 2205425
  Values: 1811141, 2205419
    Values: 1811652, 2205480
      Values: 1812161, 2205496
        Values: 1812288, 2205521
          Values: 1812415, 2205526
        Values: 1812034, 2205516
        Values: 1827651, 2205510
          Values: 1827906, 2205581
            Values: 1843777, 2205588
              Values: 1844032, 2205626
                Values: 1844159, 2205632
              Values: 1843138, 2205617
            Values: 1828288, 2205591
              Values: 1828161, 2205597
          Values: 1827012, 2205563
        Values: 1811907, 2205511
  Values: 1796034, 2205420

Upvotes: 3

Related Questions