Reputation: 355
I am trying to build a nltk to get the context of words. I have two sentences
sentences=pd.DataFrame({"sentence": ["The weather was good so I went swimming", "Because of the good food we took desert"]})
I would like to find out, what the word "good" refers to. My idea is to chunk the sentences (code from tutorial here) and then see if the word "good" and a noun are in the same node. If not, it refers to a noun before or after that.
First I build the Chunker as in the tutorial
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
NPChunker = ChunkParser(train_sents)
Then, I apply this on my sentences:
sentence=sentences["sentence"][0]
tags=nltk.pos_tag(sentence.lower().split())
result = NPChunker.parse(tags)
print result
The result looks like this
(S
(NP the/DT weather/NN)
was/VBD
(NP good/JJ)
so/RB
(NP i/JJ)
went/VBD
swimming/VBG)
Now I would like to "find" in which node the word "good" is. I have not really figured out a better way but counting the words in the nodes and in the leaves. The word "good" is word number 3 in the sentence.
stuctured_sentence=[]
for n in range(len(result)):
stuctured_sentence.append(list(result[n]))
structure_length=[]
for n in result:
if isinstance(n, nltk.tree.Tree):
if n.label() == 'NP':
print n
structure_length.append(len(n))
else:
print str(n) +"is a leaf"
structure_length.append(1)
From summing up the number of words, I know where the word "good" is.
structure_frame=pd.DataFrame({"structure": stuctured_sentence, "length": structure_length})
structure_frame["cumsum"]=structure_frame["length"].cumsum()
Is there an easier way to determine the node or leaf of words and find out to which word "good" refers to?
Best Alex
Upvotes: 3
Views: 2317
Reputation: 50190
It's easiest to find your word in a list of leaves. You can then translate the leaf index into a tree index, which is a path down the tree. To see what is grouped with good
, go up one level and examine the subtree that this picks out.
First, find out the position of good
in your flat sentence. (You could skip this if you still had the untagged sentence as a list of tokens.)
words = [ w for w, t in result.leaves() ]
Now we find the linear position of good
, and translate into a tree path:
>>> position = words.index("good")
>>> treeposition = result.leaf_treeposition(position)
>>> print(treeposition)
(2, 0)
A "treeposition" is a path down the tree, expressed as a tuple. (NLTK trees can be indexed with tuples as well as integers.) To see the sisters of good
, stop one step before you get to the end of the path.
>>> print(result[ treeposition[:-1] ])
Tree('NP', [('good', 'JJ')])
There you are. A subtree with one leaf, the pair (good, JJ)
.
Upvotes: 8