Dominik Schmidt
Dominik Schmidt

Reputation: 557

Python: Parse text file and create tree structure

What's the best way to parse a plain text tree structure like this:

node1:
    node1
    node2:
        node1
node2:
    node1
    node2
    node3:
        node1:
            node1
        node2:
            node1
            node2

and convert it into a tree structure (with lists or dictionaries)?

Is there any python library to help me with the parsing?

Upvotes: 2

Views: 2688

Answers (2)

AChampion
AChampion

Reputation: 30288

You could construct a simple parser that generates a valid python expression from the input and then evaluate it. My initial thought had been a simple recursive parser, but that's more difficult than I anticipated because their is no way to know the block is ending without a peak ahead - a common problem with indent based formats.

This generates nested list of tuples (block_name, [contents]):

i = 0
r = '['
for l in mystring.split('\n'):
    if not l:
        continue
    cl = l.lstrip(' ')
    ci = (len(l) - len(cl))//4
    if ci > i:           # line indented
        r += '['
    elif ci < i:         # line unindented, can be multiple
        r += '])'*(i-ci) + ','
    if cl[-1] == ':':    # new block
        r += '{("{}":'.format(cl[:-1])
    else:                # new item
        r += '"{}",'.format(cl)
    i = ci
r += ']'+')]'*i
eval(r)

Output:

[('node1', ['node1', ('node2', ['node1'])]),
 ('node2',
  ['node1',
   'node2',
   ('node3', [('node1', ['node1']), ('node2', ['node1', 'node2'])])])]

Upvotes: 1

Patrick Maupin
Patrick Maupin

Reputation: 8137

The rson library will do this, except you'll probably have to subclass the parser to allow the mixing of array and dict style elements in a single structure.

Edit Actually, that might be a bit difficult, but the rsonlite package will (sort of) work as-is with your data.

rsonlite is a small, single-module package that is only 300 lines long, and the same source works with both Python 2 and Python 3.

Here is an example that shows 3 different outputs from your data. The first output is what rsonlite.dumps() gives, the second output is what the slightly higher-level rsonlite.simpleparse() gives, and the third output takes the results from simpleparse and runs them through a custom fixup() function to create a pure nested dictionary data structure, where any missing value is simply set to None, and all the colon characters are checked and stripped.

from rsonlite import loads, simpleparse


mystring = '''
node1:
    node1
    node2:
        node1
node2:
    node1
    node2
    node3:
        node1:
            node1
        node2:
            node1
            node2
'''

def fixup(node):
    if isinstance(node, str):
        return node
    elif isinstance(node, dict):
        for key in node:
            assert key.endswith(':'), key
        return dict((key[:-1], fixup(value)) for key, value in node.items())
    else:
        assert isinstance(node, (list, str))
        result = {}
        for item in node:
            if isinstance(item, str):
                assert not item.endswith(':')
                assert result.setdefault(item, None) is None
            else:
                for key, value in fixup(item).items():
                    assert result.setdefault(key, value) is value
        return result

print('')
print(loads(mystring))
print('')
print(simpleparse(mystring))
print('')
print(fixup(simpleparse(mystring)))
print('')

Will give:

[('node1:', ['node1', ('node2:', ['node1'])]), ('node2:', ['node1', 'node2', ('node3:', [('node1:', ['node1']), ('node2:', ['node1', 'node2'])])])]

OrderedDict([('node1:', ['node1', OrderedDict([('node2:', 'node1')])]), ('node2:', ['node1', 'node2', OrderedDict([('node3:', OrderedDict([('node1:', 'node1'), ('node2:', ['node1', 'node2'])]))])])])

{'node1': {'node1': None, 'node2': 'node1'}, 'node2': {'node1': None, 'node2': None, 'node3': {'node1': 'node1', 'node2': {'node1': None, 'node2': None}}}}

Upvotes: 4

Related Questions