CromTheDestroyer
CromTheDestroyer

Reputation: 3766

Matching words with NLTK's chunk parser

NLTK's chunk parser's regular expressions can match POS tags, but can they also match specific words?
So, suppose I want to chunk any structure with a noun followed by the verb "left" (call this pattern L). For example, the sentence "the\DT dog\NN left\VB" should be chunked as
(S (DT the) (L (NN dog) (VB left))), but the sentence "the\DT dog\NN slept\VB" wouldn't be chunked at all.

I haven't been able to find any documentation on the chunking regex syntax, and all examples I've seen only match POS tags.

Upvotes: 8

Views: 2158

Answers (2)

Pratyush
Pratyush

Reputation: 5498

The easiest way is to convert the tags of the words. Modify the tag of the word you want to use in the regular expression.

Example:

import nltk

pos_tags = nltk.pos_tag(nltk.word_tokenize('Dog slept all night. Dog left at 8pm.'))

# modify tags for the words we want to use in regular expression
pos_tags = [
    (w, 'LEFT') if w == 'left' else (w, t)
    for w, t in pos_tags
]

grammar = "CHUNK: {<NN.*> <LEFT>}"
tree = nltk.RegexpParser(grammar).parse(pos_tags)

Upvotes: 1

Spaceghost
Spaceghost

Reputation: 6995

I had a similar problem and after realizing that the regex pattern will only examine tags, I changed the tag on the the piece I was interested in.

For example, I was trying to match product name and version and using a chunk rule like \NNP+\CD worked for "Internet Explorer 8.0" but failed on "Internet Explorer 8.0 SP2" where it tagged SP2 as a NNP.

Perhaps I could have trained a POS tagger but decided instead to just change the tag to SP and then a chunk rule like \NNP+\CD\SP* will match either example.

Upvotes: 1

Related Questions