Alex
Alex

Reputation: 4180

spaCy: custom infix regex rule to split on `:` for patterns like mailto:[email protected] is not applied consistently

With the default tokenizer, spaCy treats mailto:[email protected] as one single token.

I tried the following:

nlp = spacy.load('en_core_web_lg') infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', ) nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:[email protected], it does what I want:

nlp("mailto:[email protected]")
# [mailto, :, [email protected]]

However, if I apply the tokenizer to mailto:[email protected], it does not work as intended.

nlp("mailto:[email protected]")
# [mailto:[email protected]]

I wonder if there is a way to fix this inconsistency?

Upvotes: 1

Views: 400

Answers (1)

aab
aab

Reputation: 11484

There's a tokenizer exception pattern for URLs, which matches things like mailto:[email protected] as one token. It knows that top-level domains have at least two letters so it matches gmail.co and gmail.com but not gmail.c.

You can override it by setting:

nlp.tokenizer.token_match = None

Then you should get:

[t.text for t in nlp("mailto:[email protected]")]
# ['mailto', ':', '[email protected]']

[t.text for t in nlp("mailto:[email protected]")]
# ['mailto', ':', '[email protected]']

If you want the URL tokenization to be as by default except for mailto:, you could modify the URL_PATTERN from lang/tokenizer_exceptions.py (also see how TOKEN_MATCH is defined right below it) and use that rather than None.

Upvotes: 2

Related Questions