runningwild
runningwild

Reputation: 148

How to perform NER on true case, then lemmatization on lower case, with spaCy

I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".

The problem is I can't find a way to do both. I only have these two partial options:

My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.

However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.

So my question is: how to choose when to perform the lemmatization and which input to give to it?

Upvotes: 0

Views: 991

Answers (1)

aab
aab

Reputation: 11474

If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.

If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff

Upvotes: 1

Related Questions