Peter
Peter

Reputation: 355

Python locate words in nltk.tree

I am trying to build a nltk to get the context of words. I have two sentences

sentences=pd.DataFrame({"sentence": ["The weather was good so I went swimming", "Because of the good food we took desert"]})

I would like to find out, what the word "good" refers to. My idea is to chunk the sentences (code from tutorial here) and then see if the word "good" and a noun are in the same node. If not, it refers to a noun before or after that.

First I build the Chunker as in the tutorial

from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

class ChunkParser(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
            for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
        in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

NPChunker = ChunkParser(train_sents)

Then, I apply this on my sentences:

sentence=sentences["sentence"][0]
tags=nltk.pos_tag(sentence.lower().split())
result = NPChunker.parse(tags)
print result

The result looks like this

(S
  (NP the/DT weather/NN)
  was/VBD
  (NP good/JJ)
  so/RB
  (NP i/JJ)
  went/VBD
  swimming/VBG)

Now I would like to "find" in which node the word "good" is. I have not really figured out a better way but counting the words in the nodes and in the leaves. The word "good" is word number 3 in the sentence.

stuctured_sentence=[]
for n in range(len(result)):
    stuctured_sentence.append(list(result[n]))

structure_length=[]
for n in result:
    if isinstance(n, nltk.tree.Tree):               
        if n.label() == 'NP':
            print n
            structure_length.append(len(n))
    else:
        print str(n) +"is a leaf"
        structure_length.append(1)

From summing up the number of words, I know where the word "good" is.

structure_frame=pd.DataFrame({"structure": stuctured_sentence, "length": structure_length})
structure_frame["cumsum"]=structure_frame["length"].cumsum()

Is there an easier way to determine the node or leaf of words and find out to which word "good" refers to?

Best Alex

Upvotes: 3

Views: 2317

Answers (1)

alexis
alexis

Reputation: 50190

It's easiest to find your word in a list of leaves. You can then translate the leaf index into a tree index, which is a path down the tree. To see what is grouped with good, go up one level and examine the subtree that this picks out.

First, find out the position of good in your flat sentence. (You could skip this if you still had the untagged sentence as a list of tokens.)

words = [ w for w, t in result.leaves() ]

Now we find the linear position of good, and translate into a tree path:

>>> position = words.index("good")
>>> treeposition = result.leaf_treeposition(position)
>>> print(treeposition)
(2, 0)

A "treeposition" is a path down the tree, expressed as a tuple. (NLTK trees can be indexed with tuples as well as integers.) To see the sisters of good, stop one step before you get to the end of the path.

>>> print(result[ treeposition[:-1] ])
Tree('NP', [('good', 'JJ')])

There you are. A subtree with one leaf, the pair (good, JJ).

Upvotes: 8

Related Questions