Software Enthusiastic
Software Enthusiastic

Reputation: 26455

nltk custom tokenizer and tagger

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.

For example, following sentense

"They all like to go there on 5th November 2010, but I am not interested."

should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

Any suggestions would be useful.

Upvotes: 3

Views: 4846

Answers (2)

Neodawn
Neodawn

Reputation: 1096

You should probably do chunking with the nltk.RegexpParser to achieve your objective.

Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Upvotes: 2

Fred Foo
Fred Foo

Reputation: 363577

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

Upvotes: 7

Related Questions