Reputation: 327
I know similar questions were asked:
Spacy custom sentence spliting
Custom sentence boundary detection in SpaCy
yet my situation is a little different. I want to inherit from the spacy Sentencizer() with:
from spacy.pipeline import Sentencizer
class MySentencizer(Sentencizer):
def __init__(self):
self.tok = create_mySentencizer() # returning the sentences
def __call__(self, *args, **kwargs):
doc = args[0]
for tok in doc:
# do set the boundaries with tok.is_sent_start
return doc
Even tho splitting works fine if I call
doc = nlp("Text and so on. Another sentence.")
after updating the model:
nlp = spacy.load("some_model")
sentencizer = MySentencizer()
nlp.add_pipe(sentencizer, before="parser")
# update model
when i want to save the trained model with:
nlp.to_disk("path/to/my/model")
I get the following error:
AttributeError: 'MySentencizer' object has no attribute 'punct_chars'
Contrary, if i use the nlp.add_pipe(nlp.create_pipe('sentencizer')) the error does not occur. I wonder at what point I should have set the punct_chars attribute. It should have been inherited from the superclass?
If i replace the Sentencizer from the class and do object according to the first post, it works, but I may lose some valuable information on the way e.g. punct_chars?
Thanks for help in advance.
Chris
Upvotes: 1
Views: 686
Reputation: 25199
The following should do (note super(MySentencizer, self).__init__()
):
import spacy
from spacy.pipeline import Sentencizer
class MySentencizer(Sentencizer):
def __init__(self):
super(MySentencizer, self).__init__()
def __call__(self, *args, **kwargs):
doc = args[0]
for tok in doc:
tok.is_sent_start = True if tok.orth == "." else False
return doc
nlp = spacy.load("en_core_web_md")
sentencizer = MySentencizer()
nlp.add_pipe(sentencizer, before="parser")
nlp.to_disk("model")
Upvotes: 1