Reputation: 3263
I am trying to use NLTK package to capture the following chunk in a sentence:
verb + smth + noun
or it may be
verb + smth + noun + and + noun
I truthfully spent entire day messing with regex, but still nothing proper is produced..
I was looking at this tutorial which wasn't much of help.
Upvotes: 2
Views: 62
Reputation: 2667
When you have an idea of what those somethings that might come in between are, there is a relatively easy method using NLTK's CFG. This is most certainly not the most efficient way. For a comprehensive analysis, consult NLTK's book on chapter 8.
We have two patterns as you mentioned:
<verb> ... <noun>
<verb> ... <noun> "and" <noun>
We should assemble a list of VPs and NPs and also the range of possible words that could happen in between. As a silly little example:
grammar = nltk.CFG.fromstring("""
% start S
S -> VP SOMETHING NP
VP -> V
SOMETHING -> WORDS SOMETHING
SOMETHING ->
NP -> N 'and' N
NP -> N
V -> 'told' | 'scolded' | 'loved' | 'respected' | 'nominated' | 'rescued' | 'included'
N -> 'this' | 'us' | 'them' | 'you' | 'I' | 'me' | 'him'|'her'
WORDS -> 'among' | 'others' | 'not' | 'all' | 'of'| 'uhm' | '...' | 'let'| 'finish' | 'certainly' | 'maybe' | 'even' | 'me'
""")
Now suppose this is the list of the sentences we want to use our filter against:
sentences = ['scolded me and you', 'included certainly uhm maybe even her and I', 'loved me and maybe many others','nominated others not even him', 'told certainly among others uhm let me finish ... us and them', 'rescued all of us','rescued me and somebody else']
As you can see, the third and the last phrases don't pass the filter. We can check whether the rest match the pattern:
def sentence_filter(sent, grammar):
rd_parser = nltk.RecursiveDescentParser(grammar)
try:
for p in rd_parser.parse(sent):
print("SUCCESS!")
except:
print("Doesn't match the filter...")
for s in sentences:
s = s.split()
sentence_filter(s, grammar)
When we run this, we get this result:
>>>
SUCCESS!
SUCCESS!
Doesn't match the filter...
SUCCESS!
SUCCESS!
SUCCESS!
Doesn't match the filter...
>>>
Upvotes: 2