LuZ
LuZ

Reputation: 13

How to identify and remove trace trees from nltk.trees?

For example I got this small tree (which is obviously a subtree only):

(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))

Trace trees are those trees leading to a leaf of the shape: *.*. I now want to remove all subtrees which are a trace tree. So for this example the result should look like this:

(VP (VBZ says))

So far I extracted all those leaves:

from nltk.tree import ParentedTree
import re

traceLeaves = []    

line = "( (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))))"
currTree = ParentedTree.fromstring(line, remove_empty_top_bracketing = True)
for leaf in currTree.leaves():
    if re.search('\*', leaf):
        traceLeaves.append(leaf)

but I got no idea how to navigate up the tree until there exists a sibling which is no trace tree and remove the trace tree from the original tree. I'm completely stuck here since I only started working with nltk...

EDIT:

Here is one complete sentence I want to be able to process:

( (SINV (S-3 (S (NP-SBJ-1 (-NONE- *PRO*)) (VP (VBG Assuming) (NP (DT that) (NN post)) (PP-TMP (IN at) (NP (NP (DT the) (NN age)) (PP (IN of) (NP (CD 35))))))) (, ,) (NP-SBJ-1 (PRP he)) (VP (VBD managed) (PP-MNR (IN by) (NP (NN consensus))) (, ,) (SBAR-ADV (IN as) (S (NP-SBJ (-NONE- *PRO*)) (VP (VBZ is) (NP-PRD (DT the) (NN rule)) (PP-LOC (IN in) (NP (NNS universities)))))))) (, ,) (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))) (NP-SBJ (NP (NNP Warren) (NNP H.) (NNP Strother)) (, ,) (NP (NP (DT a) (NN university) (NN official)) (SBAR (WHNP-2 (WP who)) (S (NP-SBJ-2 (-NONE- *T*)) (VP (VBZ is) (VP (VBG researching) (NP (NP (DT a) (NN book)) (PP (IN on) (NP (NNP Mr.) (NNP Hahn)))))))))) (. .)) )

Upvotes: 1

Views: 2332

Answers (2)

alexis
alexis

Reputation: 50190

Here's a second solution, using Tree.leaf_treeposition() which maps a leaf index to a tree path (leaf scan based on this answer, suggested by @b3000). Again the tree is scanned backwards to allow editing in-place. Take your pick.

Finding the parent uses the fact that a "tree position" is a tuple of the branches to follow; so that if postn is the "position" of node n, postn[:-1] is the position of n's parent. This way, you don't even need a ParentedTree.

t = nltk.Tree.fromstring("(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))")

for ind, leaf in reversed(list(enumerate(t.leaves()))):
    if leaf.startswith("*") and leaf.endswith("*"):
        postn = t.leaf_treeposition(ind)
        parentpos = postn[:-1]
        while parentpos and len(t[parentpos]) == 1:
            postn = parentpos
            parentpos = postn[:-1]
        print(t[postn], "will be deleted")
        del t[postn]

Upvotes: 0

alexis
alexis

Reputation: 50190

Leaves are regular strings, so they're no help for navigating the tree. Scan the tree and look for subtrees of height 2 instead.

To recognize what should be deleted, note that an nltk tree is a kind of list; so to see how many children a node has, just take its len(). When you find a trace leaf, move up the tree as long as the parent only has one child; then delete the subtree. This should cover everything since if a node dominates two trace trees and nothing else, it will contain only one after you delete the first :-)

The following has one more trick: Deleting a node confuses the for-loop, since the list of branches gets shorter. To prevent things from moving before we delete them, we scan the tree backwards.

for sub in reversed(list(t.subtrees())):
    if sub.height() == 2 and sub[0].startswith("*"):  # abbreviated test
        parent = sub.parent()
        while parent and len(parent) == 1:
            sub = parent
            parent = sub.parent()
        print(sub, "will be deleted")
        del t[sub.treeposition()]

Upvotes: 2

Related Questions