Reputation: 23
1st thing 1st: I'm new to SpaCy and just started to test it. I have to say that I'm impressed by its simplicity and the doc quality. Thanks!
Now, I'm trying to identify PER in a French text. It seems to work pretty well for most but I saw a recurring incorrect pattern: names with a hyphen are not correctly recognized (ex: Pierre-Louis Durand will appear as two PER: "Pierre" and "Louis Durand").
See example:
import spacy
# nlp = spacy.load('fr')
nlp = spacy.load('fr_core_news_md')
description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
'Louis-Jean s\'est trompé. Claire a bien choisi.')
text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
entities = [e.string for e in text.ents if label==e.label_]
entities = list(entities)
print(label, entities)
output is:
LOC ['Boston ']
PER ['Jean', 'Sébastien Durand ', 'Pierre Dupond ', 'Louis', 'Jean ', 'Claire ']
It should be: "Jean-Sébastien Durand" and "Louis-Jean".
I'm not sure what to do here:
Thanks for your help (and yes I'm investigating by reading more, I love it)!
-TC
Upvotes: 2
Views: 1138
Reputation: 11484
I initially thought this would be a mismatch between the tokenizer and the training data, but it's actually a problem with how the regex that handles some words with hyphens is loaded from the saved model.
A temporary fix for spacy v2.2 models (which you have to do every time after loading a French model) is to replace the problematic tokenizer setting with the correct default setting:
import spacy
from spacy.lang.fr import French
nlp = spacy.load("fr_core_news_md")
nlp.tokenizer.token_match = French.Defaults.token_match
description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
'Louis-Jean s\'est trompé. Claire a bien choisi.')
text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
entities = [e.text for e in text.ents if label==e.label_]
entities = list(entities)
print(label, entities)
Output:
PER ['Jean-Sébastien Durand', 'Pierre Dupond']
LOC ['Boston', 'Louis-Jean', 'Claire']
(The French NER model is trained on data from Wikipedia, so it still doesn't do very well on the entity types for this particular text.)
Upvotes: 1