Reputation: 7167
Based on the grammar in the chapter 7 of the NLTK Book:
grammar = r"""
NP: {<DT|JJ|NN.*>+} # ...
"""
I want to expand NP (noun phrase) to include multiple NP joined by CC (coordinating conjunctions: and) or , (commas) to capture noun phrases like:
I cannot get my modified grammar to capture those as a single NP:
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Results in:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
I've tried moving the NP to the beginning: NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+}
but I get the same result
(S (NP The/DT house/NN) and/CC (NP tree/NN))
Upvotes: 4
Views: 836
Reputation: 122142
Lets start small and capture NP (noun phrases) properly:
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[out]:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
Now lets try to catch that and/CC
. Simply add a higher level phrase that resuse the <NP>
rule:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC><NP>}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[out]:
(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))
Now that we catch NP CC NP
phrases, lets get a little fancy and see whether it catches commas:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC|,><NP>}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Now we see that it's limited to catching the first left-bounded NP CC|, NP
and left the last NP alone.
Since we know that conjunctive phrases have left-bounded conjunction and right bounded NP in English, i.e. CC|, NP
, e.g. and the tree
, we see that the CC|, NP
pattern is repetitive, so we can use that as an intermediate representation.
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[out]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP and/CC (NP tree/NN))))
Ultimately, the CNP
(Conjunctive NPs) grammar captures the chained noun phrase conjunction in English, even complicated ones, e.g.
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[out]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP ,/, (NP the/DT green/JJ house/NN))
(XNP and/CC (NP a/DT tree/JJ)))
went/VBD
to/TO
(CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
./.)
And if you're just interested in extracting the noun phrases, from How to Traverse an NLTK Tree object?:
noun_phrases = []
def traverse_tree(tree):
if tree.label() == 'CNP':
noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
traverse_tree(subtree)
return noun_phrases
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))
[out]:
['The house , the bear , the green house and a tree', 'the park or the river']
Also, see Python (NLTK) - more efficient way to extract noun phrases?
Upvotes: 6