Reputation: 4417
I wanted to use the lemmatizer for German in Spacy, but I am very surprised by the results:
import spacy
nlp = spacy.load("de_dep_news_trf")
[token.lemma_ for token in nlp('ich du er sie mein dein sein ihr unser')]
gives
['ich', 'du', 'ich', 'ich', 'meinen', 'mein', 'mein', 'mein', 'sich']
and I am not sure I can use that:
vielen dank für deinen sehr guten tweet
becomes
viel danken für mein sehr gut tweet
which clearly changes the meaning of the sentence.
Is that expected? Am I missing a tuning/configuration that would make that lemmatizer less "aggressive"?
Upvotes: 4
Views: 867
Reputation: 11494
The current (v3.1) default German lemmatizer is just not very good. It's a very simple lookup lemmatizer with some questionable entries in its lookup table, but given license constraints for the German pretrained pipelines, there haven't been other good alternatives. (We do have some internal work-in-progress on a statistical lemmatizer, but I'm not sure when it will make into a release.)
The best suggestion here if lemmas are important for your task is to use a different lemmatizer. Depending on your task / size / speed / license requirements, you could consider using a German model from spacy-stanza
or a third-party library like spacy-iwnlp
(currently only for spacy v2, but it's probably not hard to update for v3).
Upvotes: 3