Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs. Should identify date and time in the paragraph and Tag them as DATE and TIME Should identify known phrases in the paragraph and Tag them as CUSTOM And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions? For example , following sentense "They all like to go there on 5th November 2010, but I am not interested." should be tagged and tokenized as follows in case of that custom phrase is "I am not interested" . [('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), ('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), ('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')] Any suggestions would be useful.

Reputation: 26455

nltk custom tokenizer and tagger

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.

Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?

For example, following sentense

"They all like to go there on 5th November 2010, but I am not interested."

should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

Any suggestions would be useful.

Upvotes: 3

Answers (2)

Neodawn

Reputation: 1096

You should probably do chunking with the nltk.RegexpParser to achieve your objective.

Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Upvotes: 2

Fred Foo

Reputation: 363577

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

Upvotes: 7

nltk custom tokenizer and tagger

Answers (2)

Related Questions