Reputation: 955
I have the following string:
title = 'Tesla S&P Debut Comes All at Once'
When I tokenize this in NLTK, I get the following:
token = nltk.word_tokenize(title)
token
['Tesla', 'S', '&', 'P', 'Debut', 'Comes', 'All', 'at', 'Once']
Tokenizing splits S&P
because of the &
.
How can I prevent NLTK from splitting on particular special characters?
Upvotes: 1
Views: 1338
Reputation: 86
you can use regexp_tokenize from nltk where you can choose a regular expression to define seps
from nltk import regexp_tokenize
title = 'Tesla S&P Debut Comes All at Once'
tokens = regexp_tokenize(title, pattern=r"\s|[\.,;']", gaps=True)
print(tokens)
['Tesla', 'S&P', 'Debut', 'Comes', 'All', 'at', 'Once']
Upvotes: 3