Reputation: 1091
When I type following code:
tokens = word_tokenize("[email protected]")
It gets broken into these 3 tokens: 'a' , '@' , 'b.com'
What I want to do, is to keep it as a single token '[email protected]'.
Upvotes: 1
Views: 788
Reputation: 627087
DISCLAIMER: There are a lot of email regexps out there. I am not trying to match all email formats in this question, just showing an example.
A regex approach with RegexpTokenizer
(mentioned above by lenz) can work:
from nltk.tokenize.regexp import RegexpTokenizer
line="My email: [email protected] is not accessible."
pattern = r'\S+@[^\s.]+\.[a-zA-Z]+|\w+|[^\w\s]'
tokeniser=RegexpTokenizer(pattern)
tokeniser.tokenize(line)
# => ['My', 'email', ':', '[email protected]', 'is', 'not', 'accessible', '.']
The regex matches:
\S+@[^\s.]+\.[a-zA-Z]+
- text looking like email:
\S+
- 1 or more non-whitespace chars@
- a @
symbol[^\s.]+
- 1 or more chars other than whitespaces and .
\.
- a literal dot[a-zA-Z]+
- 1 or more ASCII letters|
- or \w+
- 1 or more word chars (letters, digits, or underscores)|
- or[^\w\s]
- a single (add +
after it to match a sequence of 1 or more) occurrence of a char other than a word and whitespace char.See the online regex demo.
Upvotes: 1