Marzi Heidari
Marzi Heidari

Reputation: 2730

Sapcy 3.0 : can't add costum lookups for Lemmatizer

I used the code below to add custom Lookups to a custom Lanuage class:

def create_lookups():
    lookups = Lookups()
    lookups.add_table("lemma_lookup", LOOKUP)
    lookups.add_table("lemma_rules", json_to_dict('lemma_rules.json'))
    lookups.add_table("lemma_index", json_to_dict('lemma_index.json'))
    lookups.add_table("lemma_exc", json_to_dict('lemma_exc.json'))
    return lookups


def json_to_dict(filename):
    location = os.path.realpath(
        os.path.join(os.getcwd(), os.path.dirname(__file__)))
    with open(os.path.join(location, filename)) as f_in:
        return json.load(f_in)


@CustomeLanguage.factory(
    "lemmatizer",
    assigns=["token.lemma"],
    default_config={"model": None, "mode": "lookup", "overwrite": False},
    default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
        nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
):
    lemmatizer = Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
    lemmatizer.lookups = create_lookups()
    return lemmatizer


But when I instantiate the CustomLanguage there is no lookup table in nlp.vocab.lookups. What is the problem and how can I solve it?

Upvotes: 0

Views: 404

Answers (1)

aab
aab

Reputation: 11494

The lemmatizer lookups are no longer in the vocab. They're stored in the lemmatizer component under nlp.get_pipe("lemmatizer").lookups instead.

If your lemmatizer factory creates the lemmatizer like this, anyone loading the model will need to have these JSON files available or the model won't load. (The lookup tables are saved in the model, but your make_lemmatizer method just hasn't been written with this in mind.)

Instead, create a custom lemmatizer class that loads these tables in its initialize method and then your code would look like this to add a lemmatizer and load its tables once.

nlp = spacy.blank("lg")
nlp.add_pipe("lemmatizer").initialize()
nlp.to_disk("/path/to/model")

Once you've run initialize() once for the lemmatizer, the tables are saved with the model directory and you don't need to run it again when you reload the model.

It could look something like this, which would also allow you to pass in a Lookups object to initialize instead if you'd prefer:

class CustomLemmatizer(Lemmatizer):
    def initialize(
        self,
        get_examples: Optional[Callable[[], Iterable[Example]]] = None,
        *,
        nlp: Optional[Language] = None,
        lookups: Optional[Lookups] = None,
    ):
        if lookups is None:
            self.lookups = create_lookups()
        else:
            self.lookups = lookups

Upvotes: 3

Related Questions