Reputation: 683
I am using spaCy
to tokenize a string, and the string is likely to contain a specific substring. If the substring is present, I would like spaCy
to treat the substring as a token, regardless of any other rules it has. I would like to keep all other rules intact. Is this possible?
To provide a concrete example, suppose the substring of interest is 'banana'
; I want 'I like bananabread.'
to be tokenized as ['I', 'like', 'banana', 'bread', '.']
.
Where do I go from here (keeping in mind that I would like to keep the rest of the tokenizer rules intact)? I have tried adding 'banana'
to the prefixes, suffixes, and infixes, with no success.
Upvotes: 1
Views: 634
Reputation: 11474
Adding the string as a prefix, suffix, and infix should work, but depending on which version of spacy you're using, you may have run into a caching bug while testing. This bug is fixed in v2.2+.
With spacy v2.3.2:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I like bananabread."
assert [t.text for t in nlp(text)] == ['I', 'like', 'bananabread', '.']
prefixes = ("banana",) + nlp.Defaults.prefixes
suffixes = ("banana",) + nlp.Defaults.suffixes
infixes = ("banana",) + nlp.Defaults.infixes
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.prefix_search = prefix_regex.search
nlp.tokenizer.suffix_search = suffix_regex.search
nlp.tokenizer.infix_finditer = infix_regex.finditer
assert [t.text for t in nlp(text)] == ['I', 'like', 'banana', 'bread', '.']
(In v2.1 or earlier, the tokenizer customization still works on a newly loaded nlp
, but if you've already processed some texts with the nlp
pipeline and then modify the settings, the bug was that it would use the stored tokenization from the cache rather than the new settings.)
Upvotes: 4
Reputation: 2126
Tokenization occurs at the beginning of the spaCy pipeline, so you should preprocess the text first.
I've written a function that uses regular expressions to pad substrings in compound words:
import re
text = 'I eat bananas and bananabread at the bookstore.'
def separate_compound_toks(text):
anti_compound = sorted(['banana', 'store'])
anti_compound = "|".join(t.lower() for t in anti_compound)
# pad word from end
pattern_a = re.compile(r'(?i)({sub})(?=[a-z]{{3,}})'.format(sub=anti_compound))
text = re.sub(pattern_a, r'\1 ', text)
# pad word from beginning
pattern_b = re.compile(r'(?i)(?<![^a-z])({sub})'.format(sub=anti_compound))
text = re.sub(pattern_b, r' \1', text)
return text
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(separate_compound_toks(text))
print([tok.text for tok in doc])
# ['I', 'eat', 'bananas', 'and', 'banana', 'bread', 'at', 'the', 'book', 'store', '.']
Upvotes: 1