SpaCy NER doesn't seem to correctly recognize hyphenated names

Question

1st thing 1st: I'm new to SpaCy and just started to test it. I have to say that I'm impressed by its simplicity and the doc quality. Thanks!

Now, I'm trying to identify PER in a French text. It seems to work pretty well for most but I saw a recurring incorrect pattern: names with a hyphen are not correctly recognized (ex: Pierre-Louis Durand will appear as two PER: "Pierre" and "Louis Durand").

See example:

import spacy

# nlp = spacy.load('fr')
nlp = spacy.load('fr_core_news_md')

description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
    'Louis-Jean s\'est trompé. Claire a bien choisi.')

text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
    entities = [e.string for e in text.ents if label==e.label_]
    entities = list(entities)
    print(label, entities)

output is:

LOC ['Boston ']
PER ['Jean', 'Sébastien Durand ', 'Pierre Dupond ', 'Louis', 'Jean ', 'Claire ']

It should be: "Jean-Sébastien Durand" and "Louis-Jean".

I'm not sure what to do here:

change the way tokens are extracted (I'm wondering about the side effect for non PER) - I don't think this is the issue as a PER can be an aggregation of multiple tokens
apply a magic setting somewhere so that hyphen can be used in NER for PER
train the model
go back to school ;-)

Thanks for your help (and yes I'm investigating by reading more, I love it)!

-TC

aab · Accepted Answer

I initially thought this would be a mismatch between the tokenizer and the training data, but it's actually a problem with how the regex that handles some words with hyphens is loaded from the saved model.

A temporary fix for spacy v2.2 models (which you have to do every time after loading a French model) is to replace the problematic tokenizer setting with the correct default setting:

import spacy
from spacy.lang.fr import French

nlp = spacy.load("fr_core_news_md")
nlp.tokenizer.token_match = French.Defaults.token_match

description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
    'Louis-Jean s\'est trompé. Claire a bien choisi.')

text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
    entities = [e.text for e in text.ents if label==e.label_]
    entities = list(entities)
    print(label, entities)

Output:

PER ['Jean-Sébastien Durand', 'Pierre Dupond']
LOC ['Boston', 'Louis-Jean', 'Claire']

(The French NER model is trained on data from Wikipedia, so it still doesn't do very well on the entity types for this particular text.)

SpaCy NER doesn't seem to correctly recognize hyphenated names

Answers (1)

Related Questions

SpaCy NER doesn&#39;t seem to correctly recognize hyphenated names

Answers (1)

Related Questions

SpaCy NER doesn't seem to correctly recognize hyphenated names