ddvlamin
ddvlamin

Reputation: 181

spacy lemmatizing inconsistency with lemma_lookup table

There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.

nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc: 
  print(tok.lemma_)

This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.

nlp.vocab.lookups.get_table("lemma_lookup")["faster"]

which outputs "fast"

Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?

I'm using the following versions on Ubuntu Linux: spacy==2.2.4 spacy-lookups-data==0.1.0

Upvotes: 2

Views: 1407

Answers (2)

BehemothTheCat
BehemothTheCat

Reputation: 3

aab wrote, that:

The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.

This is also how I understood it from the spaCy code, but since I wanted to add my own dictionaries to improve the lemmatization of the pretrained models, I decided to try out the following, which worked:

#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
    "decir":["dixo", "decia", "Dixo", "Decia"],
    "ir":["iba", "Iba"],
    "pacerer":["parecia", "Parecia"],
    "poder":["podia", "Podia"],
    "ser":["fuesse", "Fuesse"],
    "haber":["habia", "havia", "Habia", "Havia"],
    "ahora" : ["aora", "Aora"],
    "estar" : ["estàn", "Estàn"],
    "lujo" : ["luxo","luxar", "Luxo","Luxar"],
    "razón" : ["razon", "razòn", "Razon", "Razòn"],
    "caballero" : ["cavallero", "Cavallero"],
    "mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
    "vez" : ["vèz", "Vèz"],
    "jamás" : ["jamas", "Jamas"],
    "demás" : ["demas", "demàs", "Demas", "Demàs"],
    "cuidar" : ["cuydado", "Cuydado"],
    "posible" : ["possible", "Possible"],
    "comedia":["comediar", "Comedias"],
    "poeta":["poetas", "Poetas"],
    "mano":["manir", "Manir"],
    "barba":["barbar", "Barbar"],
    "idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
    for token in value:
        correct = key
        wrong = token
        nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text) 

Hopefully this could help.

Upvotes: 0

aab
aab

Reputation: 11474

With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.

With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.

The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.

Upvotes: 1

Related Questions