Reputation: 2448
I have a question about whether there is a way to keep single white space as an independent token in spaCy tokenization.
For example if I ran:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is easy.")
toks = [w.text for w in doc]
toks
The result is
['This', 'is', 'easy', '.']
Instead, I would like to have something like
['This', ' ', 'is', ' ', 'easy', '.']
Is there are a simple way to do that?
Upvotes: 3
Views: 3889
Reputation: 16958
If you want the whitespaces in the doc
object:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
res = [' '] * (2 * len(words) - 1)
res[::2] = words
return Doc(self.vocab, words=res)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("This is easy.")
print([t.text for t in doc])
Upvotes: 3
Reputation: 7105
spaCy exposes the token's whitespace as the whitespace_
attribute. So if you only need a list of strings, you could do:
token_texts = []
for token in doc:
token_texts.append(token.text)
if token.whitespace_: # filter out empty strings
token_texts.append(token.whitespace_)
If you want to create an actual Doc
object out of those tokens, that's possible, too. Doc
objects can be constructed with a words
keyword argument (a list of strings to add as tokens). However, I'm not sure how useful that would be.
Upvotes: 10