tim
tim

Reputation: 3

Word tokenization NLTK abbreviation problem

I want to know how to word tokenize the following sentence (string):

"I am good. I e.g. wash the dishes."

In to the following words:

["I", "am", "good", ".", "I", "e.g.", "wash", "the", "dishes"]

Now, the problem is when it comes to abbreviations like "e.g." it is tokenized by NLTK word_tokenizer as follows ["e.g", "."]

I tried using using punkt trained with "e.g." to sentence tokenize it first but I realised that after I word tokenize it I would get the same result.

Any thoughts on how I would achieve my goal.

Note: I am rstricted to using NLTK.

Upvotes: 0

Views: 1387

Answers (1)

amanb
amanb

Reputation: 5473

The NLTK regexp_tokenize module splits a string into substrings using a regular expression. A regex pattern can be defined which will build a tokenizer that matches the groups in this pattern. We can write a pattern for your particular use-case which looks for words, abbreviations(both upper and lower case) and symbols like '.', ';' etc.

import nltk
sent = "I am good. I e.g. wash the dishes."
pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Za-z]\.)+        # abbreviations(both upper and lower case, like "e.g.", "U.S.A.")
        | \w+(?:-\w+)*        # words with optional internal hyphens 
        | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
nltk.regexp_tokenize(sent, pattern)
#Output:
['I', 'am', 'good', '.', 'I', 'e.g.', 'wash', 'the', 'dishes', '.']

The Regex pattern for abbreviations is (?:[A-Za-z]\.)+. The \. matches the "." in a forward lookup containing characters in A-Z or a-z.

On the other hand, the full stop is matched as an independent symbol in the following pattern which is not bound to a positive or negative lookahead or containment in a set of alphabets:

'[][.,;"'?():_`-]'

Upvotes: 0

Related Questions