How to tokenize word with hyphen in Spacy

Question

I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?

Raqib · Accepted Answer

You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix i.e. something that occurs in between two words, these are usually hyphens or underscores.

import re
import spacy
from spacy.tokenizer import Tokenizer

infix_re = re.compile(r'[-]')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("bs-it")
print([t.text for t in doc])

Output

['bs', '-', 'it']

How to tokenize word with hyphen in Spacy

Answers (1)

Related Questions