Reputation: 13
I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?
Upvotes: 1
Views: 857
Reputation: 1442
You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token
. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix
i.e. something that occurs in between two words, these are usually hyphens or underscores.
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'[-]')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("bs-it")
print([t.text for t in doc])
Output
['bs', '-', 'it']
Upvotes: 1