hou2zi0
hou2zi0

Reputation: 71

Extending Lemma Lookup Table in Spacy

I am currently processing texts with the NLP library Spacy. Spacy, however, does not lemmatize all words correctly, therefore I want to extend the lookup table. Currently I am merging Spacy's constant lookup table with my extension and subsequently overwrite Spacy's native lookup table.

I have the feeling, however, that this approach may not be the best and most consistent one.

Question: Is there another possibility to update the lookup table in Spacy, e.g. an update or extend function? I have read the Docs and could not find something like that. Or is this approach "just fine"?

Working example of my current approach:

import spacy
nlp = spacy.load('de')
Spacy_lookup = spacy.lang.de.LOOKUP
New_lookup = {'AAA':'Anonyme Affen Allianz','BBB':'Berliner Bauern Bund','CCC':'Chaos Chaoten Club'}
Spacy_lookup.update(New_lookup)
spacy.lang.de.LOOKUP = Spacy_lookup
tagged = nlp("Die AAA besiegt die BBB und den CCC unverdient.")
[ print(each.lemma_) for each in tagged]

Die
Anonyme Affen Allianz
besiegen
der
Berliner Bauern Bund
und
der
Chaos Chaoten Club
unverdient
.

Upvotes: 5

Views: 2021

Answers (1)

gdaras
gdaras

Reputation: 10149

Your solutions seems fine.

However, I cleaner workaround would be to take advantage of the custom spaCy pipeline feature. Specifically, you can create a new component that updates the lemma attribute if the token is in your doc and then stack it in your pipeline.

Example code:

import spacy
custom_lookup = {'AAA':'Anonyme Affen Allianz','BBB':'Berliner Bauern Bund','CCC':'Chaos Chaoten Club'}

def change_lemma_property(doc):
    for token in doc:
        if (token.text in custom_lookup):
            token.lemma_ = custom_lookup[token.text]
    return doc

nlp = spacy.load('de')
nlp.add_pipe(change_lemma_property, first=True)
text = 'Die AAA besiegt die BBB und den CCC unverdient.'
doc = nlp(text)
[print(x.lemma_) for x in doc]

Upvotes: 4

Related Questions