Sergio
Sergio

Reputation: 197

Spacy tokenizer with only "Whitespace" rule

I would like to know if the spacy tokenizer could tokenize words only using the "space" rule. For example:

sentence= "(c/o Oxford University )"

Normally, using the following configuration of spacy:

nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
   print(token)

the result would be:

 (
 c
 /
 o
 Oxford
 University
 )

Instead, I would like an output like the following (using spacy):

(c/o 
Oxford 
University
)

Is it possible to obtain a result like this using spacy?

Upvotes: 5

Views: 7072

Answers (3)

Denziloe
Denziloe

Reputation: 8162

According to the docs --

https://spacy.io/usage/spacy-101#annotations-token https://spacy.io/api/tokenizer

-- splitting on whitespace is the base behaviour of Tokenizer.

Thus, this simple solution should work:

import spacy    
from spacy.tokenizer import Tokenizer

nlp = spacy.blank("en")
tokenizer = Tokenizer(nlp.vocab)

There's a minor caveat. You didn't specify what should be done with multiple spaces. SpaCy treats these as separate tokens, so that the exact original text can be recovered from the tokens. "hello world" (with two spaces) will be tokenized as "hello", " ", "world". (With one space, it will of course just be "hello", "world").

Upvotes: 1

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25249

Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])

nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])

Before: [This, is, it, 's]
After : [This, is, it's]

You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.

An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)

You'll find out that split is due to SPECIAL rules:

[('TOKEN', 'This'),
 ('TOKEN', 'is'),
 ('SPECIAL-1', 'it'),
 ('SPECIAL-2', "'s"),
 ('SUFFIX', '.'),
 ('SPECIAL-1', 'I'),
 ('SPECIAL-2', "'m"),
 ('TOKEN', 'fine')]

that could be updated to remove "it's" from exceptions like:

exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]

[This, is, it's., I, 'm, fine]

or remove split on apostrophe altogether:

filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]

[This, is, it's., I'm, fine]

Note the dot attached to the token, which is due to the suffix rules not specified.

Upvotes: 11

Sofie VL
Sofie VL

Reputation: 3106

You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

Upvotes: 5

Related Questions