toobee
toobee

Reputation: 2752

SpaCy use Lemmatizer as stand-alone component

I want to use SpaCy's lemmatizer as a standalone component (because I have pre-tokenized text, and I don't want to re-concatenate it and run the full pipeline because SpaCy will most likely tokenize differently in some cases).

I found the lemmatizer in the package but I somehow needs to load the dictionaries with the rules to initialize this Lemmatizer. These files must be somewhere in the model of the English or German model, right? I couldn't find them there.

from spacy.lemmatizer import Lemmatizer
where do the LEMMA_INDEX, etc. files are comming from?
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

I found a similar question here: Spacy lemmatizer issue/consistency but this one did not entirely answer how to get these dictionary files from the model. The spacy.lang.* parameter seems to no longer exist in newer versions.

Upvotes: 0

Views: 489

Answers (1)

bivouac0
bivouac0

Reputation: 2570

Here's an extracted bit of code I had, that used the SpaCy lemmatizer by itself. I'm not somewhere I can run it so it might have a small bug or two if I made an editing mistake.

Note that in general, you need to know the upos for the word in order to lemmatize correctly. This code will return all the possible lemmas but I would advise modifying it to pass in the correct upos for your word.

class SpacyLemmatizer(object):
    def __init__(self, smodel):
        import spacy
        self.lemmatizer = spacy.load(smodel).vocab.morphology.lemmatizer

    # get the lemmas for every upos
    def getLemmas(self, entry):
        possible_lemmas = set()
        for upos in ('NOUN', 'VERB', 'ADJ', 'ADV'):
            lemmas = self.lemmatizer(entry, upos, morphology=None)
            lemma = lemmas[0]    # See morphology.pyx::lemmatize
            possible_lemmas.add( lemma )
        return possible_lemmas

Upvotes: 3

Related Questions