amatsuo_net
amatsuo_net

Reputation: 2448

Keeping all white spaces as tokens

I have a question about whether there is a way to keep single white space as an independent token in spaCy tokenization.

For example if I ran:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is easy.")
toks = [w.text for w in doc]
toks

The result is

['This', 'is', 'easy', '.']

Instead, I would like to have something like

['This', ' ', 'is', ' ', 'easy', '.']

Is there are a simple way to do that?

Upvotes: 3

Views: 3889

Answers (2)

Jacques Gaudin
Jacques Gaudin

Reputation: 16958

If you want the whitespaces in the doc object:

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        res = [' '] * (2 * len(words) - 1)
        res[::2] = words
        return Doc(self.vocab, words=res)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("This is easy.")
print([t.text for t in doc])

Upvotes: 3

Ines Montani
Ines Montani

Reputation: 7105

spaCy exposes the token's whitespace as the whitespace_ attribute. So if you only need a list of strings, you could do:

token_texts = []
for token in doc:
   token_texts.append(token.text)
   if token.whitespace_:  # filter out empty strings
       token_texts.append(token.whitespace_)

If you want to create an actual Doc object out of those tokens, that's possible, too. Doc objects can be constructed with a words keyword argument (a list of strings to add as tokens). However, I'm not sure how useful that would be.

Upvotes: 10

Related Questions