spacy lemmatizing inconsistency with lemma_lookup table

Question

There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.

nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc: 
  print(tok.lemma_)

This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.

nlp.vocab.lookups.get_table("lemma_lookup")["faster"]

which outputs "fast"

Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?

I'm using the following versions on Ubuntu Linux: spacy==2.2.4 spacy-lookups-data==0.1.0

aab · Accepted Answer

With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.

With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.

The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.

spacy lemmatizing inconsistency with lemma_lookup table

Answers (2)

Related Questions