Reputation: 411
Assume I have a binary tree in bracketed representation as follows:
best_parse = "(ROOT (S (S (S (S (S Our) (S intent)) (S is)) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))))) (S (S he) (S says))))"
I can get the spans (constituents) out of it using the tree_to_spans
method, which gives me [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]
import nltk
import collections
def tree_to_spans(tree):
if isinstance(tree, str):
tree = nltk.Tree.fromstring(tree)
length = len(tree.pos())
queue = collections.deque(tree.treepositions())
stack = [(queue.popleft(), 0)]
j = 0
spans = []
while stack != []:
(p, i) = stack[-1]
if not queue or queue[0][:-1] != p:
if isinstance(tree[p], nltk.tree.Tree):
if j - i > 1:
spans.append((tree[p].label(), (i, j)))
else:
j = i + 1
stack.pop()
else:
q = queue.popleft()
stack.append((q, j))
return spans
I can also get the string corresponding to the span using get_constituents
method, which gives me ALL CONSTITUENTS of the tree.
['Our intent', 'Our intent is', 'the best', 'the best alternative', 'promote the best alternative', 'to promote the best alternative', 'Our intent is to promote the best alternative', 'he says', 'Our intent is to promote the best alternative he says']
def get_constituents(sample_string):
t = nltk.Tree.fromstring(sample_string)
spans = evaluate.tree_to_spans(t)
sentence = " ".join(item[0] for item in t.pos()).split()
constituents = [" ".join(sentence[span[0]: span[1]])for span in spans]
# Add original sentence
constituents = constituents + [" ".join(sentence)]
return constituents
PROBLEM: I am stuck at converting it back to the tree (like the one shown at the beginning) given a list of ALL CONSTITUENTS. Basically, reverse engineering. Assume the sentence is given:
s = "Our intent is to promote the best alternative he says"
Edit
Assume certain elements can be deleted from the list (parts). For eg -
parts = [
'Our intent',
'the best',
'the best alternative',
'promote the best alternative',
'to promote the best alternative',
'Our intent is to promote the best alternative',
'Our intent is to promote the best alternative he says'
]
Here, "He says"
and "Our intent is"
have been removed from the list. I want to get the tree structure in the same format as above, except the ones for the deleted constituents.
Another way to look at it is:
consider, we have the spans -
spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]
I delete (0, 3)
and (8, 10)
.
I want to put brackets over, like this:
(((0 1 2) (3 (4 ((5 6 7) 8)))) 9 10)
Then, we can map the index with the corresponding string.
EDIT-2
For instance, if we were to remove ONLY "he says"
and "Our intent is"
from the parts, our final tree in bracketed form should look like:
"(ROOT (S (S (S (S (S Our) (S intent)) (S is) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative)))))(S he) (S says))))")
Another instance, if we were to remove ONLY "to promote the best alternative",
and "Our intent is to promote the best alternative"
from the parts, our final tree in bracketed form should look like:
"(ROOT (S (S (S Our) (S intent)) (S is)) (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))) (S (S he) (S says)))"
We can assume that the full-sentence "Our intent is to promote the best alternative he says"
will NEVER be deleted. This is also TRUE for single-words in the sentence, just to give you a background.
Upvotes: 2
Views: 227
Reputation: 350760
I would iterate the constituents in reversed order, so to get an in-order traversal of the tree (with the right side traversed before the left side). So in this solution I assume the order of constituents is given in the same order as you get them from your code.
With recursion you can then rebuild each subtree recursively:
def to_tree(parts):
i = len(parts) - 1
def recur(part, expect=False):
nonlocal i
words = part.split()
if len(words) == 1: # leaf
return "(S {})".format(words[0])
if expect and i > 0 and parts[i-1] == part:
i -= 1
if len(words) == 2: # 2 leaves
return "(S (S {}) (S {}))".format(*words)
i -= 1
nextpart = parts[i]
if part.endswith(" " + nextpart):
right = recur(nextpart)
left = recur(part[0:-len(nextpart)-1], True)
elif part.startswith(nextpart + " "):
right = recur(part[len(nextpart)+1:], True)
left = recur(nextpart)
else:
sides = part.split(" " + nextpart + " ")
assert len(sides) == 2, "Could not parse input"
right = recur(sides[1], True)
left = recur(sides[0], True)
return "(S {} {})".format(left, right)
return "(ROOT {})".format(recur(parts[i]))
The example could be run as:
parts = [
'Our intent',
'Our intent is',
'the best',
'the best alternative',
'promote the best alternative',
'to promote the best alternative',
'Our intent is to promote the best alternative',
'he says',
'Our intent is to promote the best alternative he says'
]
print(to_tree(parts))
...which would output your original string.
As per your edit, the above code can "survive" some removals from the input. For example 'Our intent is' and 'he says' can be removed from the input, and the output will still be the same. But there are limitations. If too much is removed, the tree cannot be rebuilt any more.
Upvotes: 2