Reputation: 5575
I am using the nltk
to split up sentences to words. e.g.
nltk.word_tokenize("The code didn't work!")
-> ['The', 'code', 'did', "n't", 'work', '!']
The tokenizing works well at spliting up word boundaries [i.e. splitting punctuation from words], but sometimes over-splits, and modifiers at the end of the word get treated as separate parts. For example, didn't
gets split into the parts did
and n't
and i've
gets split to I
and 've
. Obviously this is because such words are split in two in the original corpus that nltk
is using, and may be desirable in some instances.
Is there any built in way of over-riding this behavior? Possibly in a similar manner to how nltk's
MWETokenizer
is able to aggregate multiple words to phrases, but in this case to just aggregate word components to words.
Alternatively, is there another tokenizer that does not split up word-parts?
Upvotes: 22
Views: 15601
Reputation: 473903
This is actually working as expected:
That is the correct/expected output. For word tokenization contractions are considered two words because meaning-wise they are.
Different nltk
tokenizers handle English language contractions differently. For instance, I've found that TweetTokenizer
does not split the contraction into two parts:
>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> tknzr.tokenize("The code didn't work!")
[u'The', u'code', u"didn't", u'work', u'!']
Please see more information and workarounds at:
Upvotes: 35