Reputation: 3
I want to know how to word tokenize the following sentence (string):
"I am good. I e.g. wash the dishes."
In to the following words:
["I", "am", "good", ".", "I", "e.g.", "wash", "the", "dishes"]
Now, the problem is when it comes to abbreviations like "e.g."
it is tokenized by NLTK word_tokenizer as follows ["e.g", "."]
I tried using using punkt trained with "e.g."
to sentence tokenize it first but I realised that after I word tokenize it I would get the same result.
Any thoughts on how I would achieve my goal.
Note: I am rstricted to using NLTK.
Upvotes: 0
Views: 1387
Reputation: 5473
The NLTK regexp_tokenize module splits a string into substrings using a regular expression. A regex pattern
can be defined which will build a tokenizer that matches the groups in this pattern. We can write a pattern for your particular use-case which looks for words, abbreviations(both upper and lower case) and symbols like '.'
, ';'
etc.
import nltk
sent = "I am good. I e.g. wash the dishes."
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Za-z]\.)+ # abbreviations(both upper and lower case, like "e.g.", "U.S.A.")
| \w+(?:-\w+)* # words with optional internal hyphens
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
nltk.regexp_tokenize(sent, pattern)
#Output:
['I', 'am', 'good', '.', 'I', 'e.g.', 'wash', 'the', 'dishes', '.']
The Regex pattern for abbreviations is (?:[A-Za-z]\.)+
. The \.
matches the "."
in a forward lookup containing characters in A-Z or a-z.
On the other hand, the full stop is matched as an independent symbol in the following pattern which is not bound to a positive or negative lookahead or containment in a set of alphabets:
'[][.,;"'?():_`-]'
Upvotes: 0