Arno Werkman
Arno Werkman

Reputation: 1

How to parse a BinaryTree from a string?

I want to code a BinaryTree parser. I don't know how to solve this problem. I've tried using regular expressions recursively but I can't find good resources. My goal is:

Here is some source code:

class BinaryTree:
    def __init__(self, left=None, name=None, right=None):
        self.left = left
        self.name = name
        self.right = right

    def __str__(self):
        return f"({self.left}) {self.name} ({self.right})"

    def __repr__(self):
        return f"BinaryTree({repr(self.left)}, {repr(self.name)}, {repr(self.right)})"

    def __len__(self):
        if self.name is not None:
            output = 1
        else:
            output = 0
        if self.left is not None:
            output += len(self.left)
        if self.right is not None:
            output += len(self.right)
        return output

    @staticmethod
    def from_string(string):
        # "(x) y (z)" --> BinaryTree("x", "y", "z")
        # "((a) b (c)) y (z)" --> BinaryTree(BinaryTree("a", "b", "c"), "y", "z")
        # "" --> None
        # ()  () --> BinaryTree("", "", "")
        pass

Upvotes: 0

Views: 669

Answers (2)

trincot
trincot

Reputation: 350272

You can use regular expressions, but only use it to tokenize the input, i.e. it should match all parentheses as individual matches, and match quoted or non-quoted literals. Some care has to be taken to support backslash escaping inside quoted substrings.

For converting quoted strings and numbers to the corresponding Python data typed values, you can use ast.literal_eval. Of course, if the input format would have been a valid Python expression (using comma separators, ...etc), you could leave the parsing completely to ast.literal_eval. But as this is not the case, you'll have to tokenize the input and iterate over the tokens.

So import these:

import re
import ast

And then:

    @staticmethod
    def from_string(string):
        tokens = re.finditer(r"[()]|'(?:\\.|[^'])*'|[^\s()]+|$", string)

        def format(token):
            return f"'{token}'" if token else "end of input"

        def take(end, expect=None, forbid=None):
            token = next(tokens).group(0)
            if expect is not None and token != expect:
                raise ValueError("Expected {}, got {}".format(format(expect), format(token)))
            if end != "" and token == "" or token == forbid:
                raise ValueError("Unexpected {}".format(format(token)))
            if token not in ("(", ")", ""):
                token = ast.literal_eval(token)
            return token

        def recur(end=")"):
            token = take(end)
            if token == end:  # it is an empty leaf
                return None  
            if token != "(":  # it is a leaf
                take(end, end)
                return token
            # It is a (left)-name-(right) sequence:
            left = recur()
            name = None
            token = take(end, None, end)
            if token != "(":
                name = token
                take(end, "(")
            right = recur()
            take(end, end)
            return BinaryTree(left, name, right)

        return recur("")

Upvotes: 0

Prune
Prune

Reputation: 77847

First, I believe that you need to drop the idea of regular expressions and concentrate on simple matching the parentheses. You have a very simple expression grammar here. Rather than reproducing such a well-traveled exercise, I simply direct your to research how to parse a binary tree expression with parentheses.

The basic expression is

left root right

where each of left and right is either

  • a sub-tree (first char is a left parenthesis)
  • a leaf-node label (first char is something else)
  • null (white space)

Note that you have some ambiguities. For instance, given a b, is the resulting tree (a, b, None), (None, a, b), or an error?

In any case, if you focus on simple string processing, you should be able to do this without external packages:

  • Find the first left parenthesis and its matching right.
  • In the string after that, look again for a left-right pair.
  • If there's anything before that first left-paren, then it must be a leaf node for the left and the node label for the root.
  • Either way, there must be a root node in the middle (unless this is a degenerate tree).
  • Recur on each of the paren matches you made.

Can you take it from there?

Upvotes: 1

Related Questions