Get the depth of words from a nltk tree

I'm working on a nlp project and I want to filter out words depending on its position in the dependency tree.

To plot the tree I'm using the code from this post:

def to_nltk_tree(node):

    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_

For a sample sentence:

"A group of people around the world are suddenly linked mentally"

I got this tree:

enter image description here

From this tree what I want to get is a list of tuples with the word and its corresponding depth in the tree:

[(linked,1),(are,2),(suddenly,2),(mentally,2),(group,2),(A,3),(of,3),(people,4)....]

For this case, I'm not interested in words which does not have childs: [are,suddenly,mentally,A,the] So what I have been able to do so far is to get only the list of words which have children, to do it so I'm using this code:

def get_words(root,words):
    children = list(root.children)
    for child in children:
        if list(child.children):
            words.append(child)
            get_words(child,words)
    return list(set(words)

[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]
s_root = list(doc.sents)[0].root
words = []
words.append(s_root)    
words = get_words(s_root,words)
words

[around, linked, world, of, people, group]

From this how can I get the desired tuples with the words and its respective depth?

Upvotes: 1

Views: 2259

Answers (1)

alexis
alexis

Reputation: 50200

Are you sure that's an nltk Tree in your code? The nltk's Tree class does not have a children attribute. With an nltk Tree, you can do what you want by using "treepositions", which are paths down the tree. Each path is a tuple of branch choices. The treeposition of "people" is (0, 2, 1, 0), and as you can see the depth of a node is just the length of its treeposition.

First I get the paths of the leaves so I can exclude them:

t = nltk.Tree.fromstring("""(linked (are suddenly mentally 
                                     (group A (of (people (around (world the)))))))""")
n_leaves = len(t.leaves())
leavepos = set(t.leaf_treeposition(n) for n in range(n_leaves))

Now it's easy to list the non-terminal nodes and their depth:

>>> for pos in t.treepositions():
        if pos not in leavepos:
            print(t[pos].label(), len(pos))
linked 0
are 1
group 2
of 3
people 4
around 5
world 6

Incidentally, nltk trees have their own display methods. Try print(t) or t.draw(), which draws the tree in a pop-up window.

Upvotes: 1

Related Questions