Reputation: 383
I am using spaCy's sentencizer to split the sentences.
from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
print([token.text for token in doc])
OUTPUT
['Please read the analysis. (',
"You'll be amazed.)"]
['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be',
'amazed', '.', ')']
Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.
Upvotes: 8
Views: 19822
Reputation: 383
I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).
Below custom boundaries only works with sm model and behave different splitting with lg model.
nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ".(" or token.text == ").":
doc[token.i+1].is_sent_start = True
elif token.text == "Rs." or token.text == ")":
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)
Upvotes: 4
Reputation: 11474
The sentencizer
is a very fast but also very minimal sentence splitter that's not going to have good performance with punctuation like this. It's good for splitting texts into sentence-ish chunks, but if you need higher quality sentence segmentation, use the parser
component of an English model to do sentence segmentation.
Upvotes: 1