BioBroo
BioBroo

Reputation: 683

How can I add a specific substring to tokenize on in spaCy?

I am using spaCy to tokenize a string, and the string is likely to contain a specific substring. If the substring is present, I would like spaCy to treat the substring as a token, regardless of any other rules it has. I would like to keep all other rules intact. Is this possible?

To provide a concrete example, suppose the substring of interest is 'banana'; I want 'I like bananabread.' to be tokenized as ['I', 'like', 'banana', 'bread', '.'].

Where do I go from here (keeping in mind that I would like to keep the rest of the tokenizer rules intact)? I have tried adding 'banana' to the prefixes, suffixes, and infixes, with no success.

Upvotes: 1

Views: 634

Answers (2)

aab
aab

Reputation: 11474

Adding the string as a prefix, suffix, and infix should work, but depending on which version of spacy you're using, you may have run into a caching bug while testing. This bug is fixed in v2.2+.

With spacy v2.3.2:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "I like bananabread."
assert [t.text for t in nlp(text)] == ['I', 'like', 'bananabread', '.']

prefixes = ("banana",) + nlp.Defaults.prefixes
suffixes = ("banana",) + nlp.Defaults.suffixes
infixes = ("banana",) + nlp.Defaults.infixes

prefix_regex = spacy.util.compile_prefix_regex(prefixes)
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
infix_regex = spacy.util.compile_infix_regex(infixes)

nlp.tokenizer.prefix_search = prefix_regex.search
nlp.tokenizer.suffix_search = suffix_regex.search
nlp.tokenizer.infix_finditer = infix_regex.finditer

assert [t.text for t in nlp(text)]  == ['I', 'like', 'banana', 'bread', '.']

(In v2.1 or earlier, the tokenizer customization still works on a newly loaded nlp, but if you've already processed some texts with the nlp pipeline and then modify the settings, the bug was that it would use the stored tokenization from the cache rather than the new settings.)

Upvotes: 4

thorntonc
thorntonc

Reputation: 2126

Tokenization occurs at the beginning of the spaCy pipeline, so you should preprocess the text first.

I've written a function that uses regular expressions to pad substrings in compound words:

import re

text = 'I eat bananas and bananabread at the bookstore.'

def separate_compound_toks(text):
    anti_compound = sorted(['banana', 'store'])
    anti_compound = "|".join(t.lower() for t in anti_compound)
    # pad word from end
    pattern_a = re.compile(r'(?i)({sub})(?=[a-z]{{3,}})'.format(sub=anti_compound))
    text = re.sub(pattern_a, r'\1 ', text)
    # pad word from beginning
    pattern_b = re.compile(r'(?i)(?<![^a-z])({sub})'.format(sub=anti_compound))
    text = re.sub(pattern_b, r' \1', text)
    return text


import spacy
nlp = spacy.load("en_core_web_sm")
 
doc = nlp(separate_compound_toks(text))
print([tok.text for tok in doc])
# ['I', 'eat', 'bananas', 'and', 'banana', 'bread', 'at', 'the', 'book', 'store', '.']

Upvotes: 1

Related Questions