Reputation: 1030
I have a a txt file representing a tree that is defined in the following BNF like format:
branches : '(' <value1> <value2> <n_children> <branches> ')'
| ''
Each node for the tree has two values value1
and value2
that are integers, number of children that is an integer, and branches.
an example of a subset of the file looks like:
(1796161 2205411 3
(1796288 2205425 0 )
(1811141 2205419 1
(1811652 2205480 1
(1812161 2205496 4
(1812288 2205521 1
(1812415 2205526 0 ))
(1812034 2205516 0 )
(1827651 2205510 2
(1827906 2205581 2
(1843777 2205588 2
(1844032 2205626 1
(1844159 2205632 0 ))
(1843138 2205617 0 ))
(1828288 2205591 1
(1828161 2205597 0 )))
(1827012 2205563 0 ))
(1811907 2205511 0 ))))
(1796034 2205420 0 ))
Is there a nice way to parse this data using regular expression(s) that wouldn't involve me manually reading the file character by character keeping track of all the parantheses and preserves the relationship (parent-child) ordering.
Upvotes: 1
Views: 775
Reputation: 24083
This is easiest to do with a parser that lets you write your own grammar, like pyPEG. The following creates a tree of Node
objects, where each Node
can have zero or more children:
import re
from pypeg2 import attr, ignore, List, maybe_some, parse
Int = re.compile(r'[0-9]+') # a positive integer
class Values(List):
'''A pair of values associated with each node in the tree.
For example, in the node
( 1 2 0 )
the values are 1 and 2 (and 0 is the number of children).
'''
grammar = 2, Int
class Node(List):
'''A node in the tree.
Attributes:
values The pair of values associated with this node
children A list of child nodes
'''
def __repr__(self):
return 'Values: ' + ', '.join(self.values)
# Grammar for Node is recursive (a Node can contain other Nodes),
# so we have to set it outside of the Node class definition.
Node.grammar = (
'(',
attr('values', Values),
ignore(Int),
attr('children', maybe_some(Node)),
')'
)
def print_tree(root, indent=0):
'''Print a tree of Nodes recursively'''
print(' ' * indent, end='')
print(root)
if len(root.children) > 0:
for child in root.children:
print_tree(child, indent + 2)
if __name__ == '__main__':
tree = '''
(1796161 2205411 3
(1796288 2205425 0 )
(1811141 2205419 1
(1811652 2205480 1
(1812161 2205496 4
(1812288 2205521 1
(1812415 2205526 0 ))
(1812034 2205516 0 )
(1827651 2205510 2
(1827906 2205581 2
(1843777 2205588 2
(1844032 2205626 1
(1844159 2205632 0 ))
(1843138 2205617 0 ))
(1828288 2205591 1
(1828161 2205597 0 )))
(1827012 2205563 0 ))
(1811907 2205511 0 ))))
(1796034 2205420 0 ))
'''
root = parse(tree, Node)
print_tree(root)
Output:
Values: 1796161, 2205411
Values: 1796288, 2205425
Values: 1811141, 2205419
Values: 1811652, 2205480
Values: 1812161, 2205496
Values: 1812288, 2205521
Values: 1812415, 2205526
Values: 1812034, 2205516
Values: 1827651, 2205510
Values: 1827906, 2205581
Values: 1843777, 2205588
Values: 1844032, 2205626
Values: 1844159, 2205632
Values: 1843138, 2205617
Values: 1828288, 2205591
Values: 1828161, 2205597
Values: 1827012, 2205563
Values: 1811907, 2205511
Values: 1796034, 2205420
Upvotes: 3