Reputation: 685

spaCy - Tokenization of Hyphenated words

Good day SO,

I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example:

Example:

Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']

For now, my solution is to use the matcher:

matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]

matcher.add('HYPHENATED', None, pattern)

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)

However, this will cause an expected issue below:

Example 2:

Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..'

Thank you very much

EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals:

Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]

Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]

That is why I used Matcher instead of trying to edit the regex.

Upvotes: 15

Answers (2)

Anurag verma

Reputation: 19

If nlp = spacy.load('en') throws error, use nlp = spacy.load("en_core_web_sm")

Upvotes: 0

aab

Reputation: 11494

The Matcher is not really the right tool for this. You should modify the tokenizer instead.

If you want to preserve how everything else is handled and only change the behavior for hyphens, you should modify the existing infix pattern and preserve all the other settings. The current English infix pattern definition is here:

https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49

You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer. So, if you comment out the hyphen pattern and define a custom tokenizer:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]

You do need to provide the current prefix/suffix/token_match settings when initializing the new Tokenizer to preserve the existing tokenizer behavior. See also (for German, but very similar): https://stackoverflow.com/a/57304882/461847

Edited to add (since this does seem unnecessarily complicated and you really should be able to redefine the infix patterns without loading a whole new custom tokenizer):

If you have just loaded the model (for v2.1.8) and you haven't called nlp() yet, you can also just replace the infix_re.finditer without creating a custom tokenizer:

nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer

There's a caching bug that should hopefully be fixed in v2.2 that will let this work correctly at any point rather than just with a newly loaded model. (The behavior is extremely confusing otherwise, which is why creating a custom tokenizer has been a better general-purpose recommendation for v2.1.8.)

Upvotes: 12

spaCy - Tokenization of Hyphenated words

Answers (2)

Related Questions