Reputation: 13
For example I got this small tree (which is obviously a subtree only):
(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))
Trace trees are those trees leading to a leaf of the shape: *.*. I now want to remove all subtrees which are a trace tree. So for this example the result should look like this:
(VP (VBZ says))
So far I extracted all those leaves:
from nltk.tree import ParentedTree
import re
traceLeaves = []
line = "( (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))))"
currTree = ParentedTree.fromstring(line, remove_empty_top_bracketing = True)
for leaf in currTree.leaves():
if re.search('\*', leaf):
traceLeaves.append(leaf)
but I got no idea how to navigate up the tree until there exists a sibling which is no trace tree and remove the trace tree from the original tree. I'm completely stuck here since I only started working with nltk...
EDIT:
Here is one complete sentence I want to be able to process:
( (SINV (S-3 (S (NP-SBJ-1 (-NONE- *PRO*)) (VP (VBG Assuming) (NP (DT that) (NN post)) (PP-TMP (IN at) (NP (NP (DT the) (NN age)) (PP (IN of) (NP (CD 35))))))) (, ,) (NP-SBJ-1 (PRP he)) (VP (VBD managed) (PP-MNR (IN by) (NP (NN consensus))) (, ,) (SBAR-ADV (IN as) (S (NP-SBJ (-NONE- *PRO*)) (VP (VBZ is) (NP-PRD (DT the) (NN rule)) (PP-LOC (IN in) (NP (NNS universities)))))))) (, ,) (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))) (NP-SBJ (NP (NNP Warren) (NNP H.) (NNP Strother)) (, ,) (NP (NP (DT a) (NN university) (NN official)) (SBAR (WHNP-2 (WP who)) (S (NP-SBJ-2 (-NONE- *T*)) (VP (VBZ is) (VP (VBG researching) (NP (NP (DT a) (NN book)) (PP (IN on) (NP (NNP Mr.) (NNP Hahn)))))))))) (. .)) )
Upvotes: 1
Views: 2332
Reputation: 50190
Here's a second solution, using Tree.leaf_treeposition()
which maps a leaf index to a tree path (leaf scan based on this answer, suggested by @b3000). Again the tree is scanned backwards to allow editing in-place. Take your pick.
Finding the parent uses the fact that a "tree position" is a tuple of the branches to follow; so that if postn
is the "position" of node n
, postn[:-1]
is the position of n
's parent. This way, you don't even need a ParentedTree
.
t = nltk.Tree.fromstring("(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))")
for ind, leaf in reversed(list(enumerate(t.leaves()))):
if leaf.startswith("*") and leaf.endswith("*"):
postn = t.leaf_treeposition(ind)
parentpos = postn[:-1]
while parentpos and len(t[parentpos]) == 1:
postn = parentpos
parentpos = postn[:-1]
print(t[postn], "will be deleted")
del t[postn]
Upvotes: 0
Reputation: 50190
Leaves are regular strings, so they're no help for navigating the tree. Scan the tree and look for subtrees of height 2 instead.
To recognize what should be deleted, note that an nltk tree is a kind of list; so to see how many children a node has, just take its len()
. When you find a trace leaf, move up the tree as long as the parent only has one child; then delete the subtree. This should cover everything since if a node dominates two trace trees and nothing else, it will contain only one after you delete the first :-)
The following has one more trick: Deleting a node confuses the for-loop, since the list of branches gets shorter. To prevent things from moving before we delete them, we scan the tree backwards.
for sub in reversed(list(t.subtrees())):
if sub.height() == 2 and sub[0].startswith("*"): # abbreviated test
parent = sub.parent()
while parent and len(parent) == 1:
sub = parent
parent = sub.parent()
print(sub, "will be deleted")
del t[sub.treeposition()]
Upvotes: 2