How to update nltk package so that it does not break email into 3 different tokens?

Question

When I type following code: tokens = word_tokenize("a@b.com")

It gets broken into these 3 tokens: 'a' , '@' , 'b.com'

What I want to do, is to keep it as a single token 'a@b.com'.

Wiktor Stribiżew · Accepted Answer

DISCLAIMER: There are a lot of email regexps out there. I am not trying to match all email formats in this question, just showing an example.

A regex approach with RegexpTokenizer (mentioned above by lenz) can work:

from nltk.tokenize.regexp import RegexpTokenizer
line="My email: a@bc.com is not accessible."
pattern = r'\S+@[^\s.]+\.[a-zA-Z]+|\w+|[^\w\s]'
tokeniser=RegexpTokenizer(pattern)
tokeniser.tokenize(line)
# => ['My', 'email', ':', 'a@bc.com', 'is', 'not', 'accessible', '.']

The regex matches:

\S+@[^\s.]+\.[a-zA-Z]+ - text looking like email:
- \S+ - 1 or more non-whitespace chars
- @ - a @ symbol
- [^\s.]+ - 1 or more chars other than whitespaces and .
- \. - a literal dot
- [a-zA-Z]+ - 1 or more ASCII letters
| - or
\w+ - 1 or more word chars (letters, digits, or underscores)
| - or
[^\w\s] - a single (add + after it to match a sequence of 1 or more) occurrence of a char other than a word and whitespace char.

See the online regex demo.

How to update nltk package so that it does not break email into 3 different tokens?

Answers (1)

Related Questions