Reputation: 285
I have a corpus of English sentences
sentences = [
"Mary had a little lamb.",
"John has a cute black pup.",
"I ate five apples."
]
and a grammar (for the sake of simplicity)
grammar = ('''
NP: {<NNP><VBZ|VBD><DT><JJ>*<NN><.>} # NP
''')
I wish to filter out the sentences which don't conform to the grammar. Is there a built-in NLTK function which can achieve this? In the above example, first two sentences follow the pattern of my grammar, but not the last one.
Upvotes: 2
Views: 810
Reputation: 121992
Write a grammar, check that it parses, iterate through the subtrees and look for the non-terminals you're looking for, e.g. NP
See:
Code:
import nltk
grammar = ('''
NP: {<NNP><VBZ|VBD><DT><JJ>*<NN><.>} # NP
''')
sentences = [
"Mary had a little lamb.",
"John has a cute black pup.",
"I ate five apples."
]
def has_noun_phrase(sentence):
parsed = chunkParser.parse(pos_tag(word_tokenize(sentence)))
for subtree in parsed:
if type(subtree) == nltk.Tree and subtree.label() == 'NP':
return True
return False
chunkParser = nltk.RegexpParser(grammar)
for sentence in sentences:
print(has_noun_phrase(sentence))
Upvotes: 1
Reputation: 488
NLTK supports POS tagging, you can firstly apply POS tagging to your sentences, and then compare with the pre-defined grammar. Below is an example of using NLTK POS tagging.
Upvotes: 0