Grimlock
Grimlock

Reputation: 1091

How to update nltk package so that it does not break email into 3 different tokens?

When I type following code: tokens = word_tokenize("[email protected]")

It gets broken into these 3 tokens: 'a' , '@' , 'b.com'

What I want to do, is to keep it as a single token '[email protected]'.

Upvotes: 1

Views: 788

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

DISCLAIMER: There are a lot of email regexps out there. I am not trying to match all email formats in this question, just showing an example.

A regex approach with RegexpTokenizer (mentioned above by lenz) can work:

from nltk.tokenize.regexp import RegexpTokenizer
line="My email: [email protected] is not accessible."
pattern = r'\S+@[^\s.]+\.[a-zA-Z]+|\w+|[^\w\s]'
tokeniser=RegexpTokenizer(pattern)
tokeniser.tokenize(line)
# => ['My', 'email', ':', '[email protected]', 'is', 'not', 'accessible', '.']

The regex matches:

  • \S+@[^\s.]+\.[a-zA-Z]+ - text looking like email:
    • \S+ - 1 or more non-whitespace chars
    • @ - a @ symbol
    • [^\s.]+ - 1 or more chars other than whitespaces and .
    • \. - a literal dot
    • [a-zA-Z]+ - 1 or more ASCII letters
  • | - or
  • \w+ - 1 or more word chars (letters, digits, or underscores)
  • | - or
  • [^\w\s] - a single (add + after it to match a sequence of 1 or more) occurrence of a char other than a word and whitespace char.

See the online regex demo.

Upvotes: 1

Related Questions