Reputation: 557
What's the best way to parse a plain text tree structure like this:
node1:
node1
node2:
node1
node2:
node1
node2
node3:
node1:
node1
node2:
node1
node2
and convert it into a tree structure (with lists or dictionaries)?
Is there any python library to help me with the parsing?
Upvotes: 2
Views: 2688
Reputation: 30288
You could construct a simple parser that generates a valid python expression from the input and then evaluate it. My initial thought had been a simple recursive parser, but that's more difficult than I anticipated because their is no way to know the block is ending without a peak ahead - a common problem with indent based formats.
This generates nested list of tuples (block_name, [contents]):
i = 0
r = '['
for l in mystring.split('\n'):
if not l:
continue
cl = l.lstrip(' ')
ci = (len(l) - len(cl))//4
if ci > i: # line indented
r += '['
elif ci < i: # line unindented, can be multiple
r += '])'*(i-ci) + ','
if cl[-1] == ':': # new block
r += '{("{}":'.format(cl[:-1])
else: # new item
r += '"{}",'.format(cl)
i = ci
r += ']'+')]'*i
eval(r)
Output:
[('node1', ['node1', ('node2', ['node1'])]),
('node2',
['node1',
'node2',
('node3', [('node1', ['node1']), ('node2', ['node1', 'node2'])])])]
Upvotes: 1
Reputation: 8137
The rson library will do this, except you'll probably have to subclass the parser to allow the mixing of array and dict style elements in a single structure.
Edit Actually, that might be a bit difficult, but the rsonlite package will (sort of) work as-is with your data.
rsonlite is a small, single-module package that is only 300 lines long, and the same source works with both Python 2 and Python 3.
Here is an example that shows 3 different outputs from your data. The first output is what rsonlite.dumps()
gives, the second output is what the slightly higher-level rsonlite.simpleparse()
gives, and the third output takes the results from simpleparse
and runs them through a custom fixup()
function to create a pure nested dictionary data structure, where any missing value is simply set to None
, and all the colon characters are checked and stripped.
from rsonlite import loads, simpleparse
mystring = '''
node1:
node1
node2:
node1
node2:
node1
node2
node3:
node1:
node1
node2:
node1
node2
'''
def fixup(node):
if isinstance(node, str):
return node
elif isinstance(node, dict):
for key in node:
assert key.endswith(':'), key
return dict((key[:-1], fixup(value)) for key, value in node.items())
else:
assert isinstance(node, (list, str))
result = {}
for item in node:
if isinstance(item, str):
assert not item.endswith(':')
assert result.setdefault(item, None) is None
else:
for key, value in fixup(item).items():
assert result.setdefault(key, value) is value
return result
print('')
print(loads(mystring))
print('')
print(simpleparse(mystring))
print('')
print(fixup(simpleparse(mystring)))
print('')
Will give:
[('node1:', ['node1', ('node2:', ['node1'])]), ('node2:', ['node1', 'node2', ('node3:', [('node1:', ['node1']), ('node2:', ['node1', 'node2'])])])]
OrderedDict([('node1:', ['node1', OrderedDict([('node2:', 'node1')])]), ('node2:', ['node1', 'node2', OrderedDict([('node3:', OrderedDict([('node1:', 'node1'), ('node2:', ['node1', 'node2'])]))])])])
{'node1': {'node1': None, 'node2': 'node1'}, 'node2': {'node1': None, 'node2': None, 'node3': {'node1': 'node1', 'node2': {'node1': None, 'node2': None}}}}
Upvotes: 4