How do I ignore special characters when tokenizing in NLTK?

Question

I have the following string:

title = 'Tesla S&P Debut Comes All at Once'

When I tokenize this in NLTK, I get the following:

token = nltk.word_tokenize(title)
token
['Tesla', 'S', '&', 'P', 'Debut', 'Comes', 'All', 'at', 'Once']

Tokenizing splits S&P because of the &.

How can I prevent NLTK from splitting on particular special characters?

djimab · Accepted Answer

you can use regexp_tokenize from nltk where you can choose a regular expression to define seps

from nltk import regexp_tokenize
title = 'Tesla S&P Debut Comes All at Once'
tokens = regexp_tokenize(title, pattern=r"\s|[\.,;']", gaps=True)

print(tokens)

['Tesla', 'S&P', 'Debut', 'Comes', 'All', 'at', 'Once']

How do I ignore special characters when tokenizing in NLTK?

Answers (1)

Related Questions