ChrisDelClea
ChrisDelClea

Reputation: 327

SpaCy save model to disk with custom Sentencizer error

I know similar questions were asked:

Spacy custom sentence spliting

Custom sentence boundary detection in SpaCy

yet my situation is a little different. I want to inherit from the spacy Sentencizer() with:

from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        self.tok = create_mySentencizer() # returning the sentences

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            # do set the boundaries with tok.is_sent_start 
        return doc

Even tho splitting works fine if I call doc = nlp("Text and so on. Another sentence.") after updating the model:

  nlp = spacy.load("some_model")
  sentencizer = MySentencizer()
  nlp.add_pipe(sentencizer, before="parser")
  # update model 

when i want to save the trained model with:

nlp.to_disk("path/to/my/model")

I get the following error:

AttributeError: 'MySentencizer' object has no attribute 'punct_chars'

Contrary, if i use the nlp.add_pipe(nlp.create_pipe('sentencizer')) the error does not occur. I wonder at what point I should have set the punct_chars attribute. It should have been inherited from the superclass?

If i replace the Sentencizer from the class and do object according to the first post, it works, but I may lose some valuable information on the way e.g. punct_chars?

Thanks for help in advance.

Chris

Upvotes: 1

Views: 686

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

The following should do (note super(MySentencizer, self).__init__()):

import spacy
from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        super(MySentencizer, self).__init__() 

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            tok.is_sent_start = True if tok.orth == "." else False
        return doc

nlp = spacy.load("en_core_web_md")
sentencizer = MySentencizer()
nlp.add_pipe(sentencizer, before="parser")

nlp.to_disk("model")

Upvotes: 1

Related Questions