TC iO
TC iO

Reputation: 23

SpaCy NER doesn't seem to correctly recognize hyphenated names

1st thing 1st: I'm new to SpaCy and just started to test it. I have to say that I'm impressed by its simplicity and the doc quality. Thanks!

Now, I'm trying to identify PER in a French text. It seems to work pretty well for most but I saw a recurring incorrect pattern: names with a hyphen are not correctly recognized (ex: Pierre-Louis Durand will appear as two PER: "Pierre" and "Louis Durand").

See example:

import spacy

# nlp = spacy.load('fr')
nlp = spacy.load('fr_core_news_md')

description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
    'Louis-Jean s\'est trompé. Claire a bien choisi.')

text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
    entities = [e.string for e in text.ents if label==e.label_]
    entities = list(entities)
    print(label, entities)

output is:

LOC ['Boston ']
PER ['Jean', 'Sébastien Durand ', 'Pierre Dupond ', 'Louis', 'Jean ', 'Claire ']

It should be: "Jean-Sébastien Durand" and "Louis-Jean".

I'm not sure what to do here:

  1. change the way tokens are extracted (I'm wondering about the side effect for non PER) - I don't think this is the issue as a PER can be an aggregation of multiple tokens
  2. apply a magic setting somewhere so that hyphen can be used in NER for PER
  3. train the model
  4. go back to school ;-)

Thanks for your help (and yes I'm investigating by reading more, I love it)!

-TC

Upvotes: 2

Views: 1138

Answers (1)

aab
aab

Reputation: 11484

I initially thought this would be a mismatch between the tokenizer and the training data, but it's actually a problem with how the regex that handles some words with hyphens is loaded from the saved model.

A temporary fix for spacy v2.2 models (which you have to do every time after loading a French model) is to replace the problematic tokenizer setting with the correct default setting:

import spacy
from spacy.lang.fr import French

nlp = spacy.load("fr_core_news_md")
nlp.tokenizer.token_match = French.Defaults.token_match

description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
    'Louis-Jean s\'est trompé. Claire a bien choisi.')

text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
    entities = [e.text for e in text.ents if label==e.label_]
    entities = list(entities)
    print(label, entities)

Output:

PER ['Jean-Sébastien Durand', 'Pierre Dupond']
LOC ['Boston', 'Louis-Jean', 'Claire']

(The French NER model is trained on data from Wikipedia, so it still doesn't do very well on the entity types for this particular text.)

Upvotes: 1

Related Questions