Reputation: 473
I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't splits words that contain apostrophs (').
Example:
Input string : "I can't do this" current output: ["I","ca","n't","do","this"] Expected output: ["I","can't","do","this"]
My Tries:
doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
for pos in position:
retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
print(token.text)
In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?
Upvotes: 2
Views: 4062
Reputation: 11474
The retokenizer approach works, but the simpler way is to modify the tokenizer so it doesn't split these words in the first place. The contractions with apostrophes that are split like this (don't
, can't
, I'm
, you'll
, etc.) are handled by tokenizer exceptions.
With spacy v2.2.3, you can inspect and set tokenizer exceptions with the property nlp.tokenizer.rules
. To remove the exceptions with any kind of apostrophe:
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.rules = {key: value for key, value in nlp.tokenizer.rules.items() if "'" not in key and "’" not in key and "‘" not in key}
assert [t.text for t in nlp("can't")] == ["can't"]
Be aware that the default models (tagger, parser, NER) provided by spacy for English won't work as well on texts with this tokenization because they're trained on data with the contractions split.
With older versions of spacy, you'll have to create a custom tokenizer and pass in a modified rules=
after modifying nlp.Defaults.tokenizer_exceptions
. Use all the other existing settings (nlp.tokenizer.prefix_search / suffix_search / infix_finditer / token_match
) to keep the existing tokenization in all other cases.
Upvotes: 6