Reputation: 383

Sentence split using spacy sentenizer

I am using spaCy's sentencizer to split the sentences.

from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)

text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)

sents_list = []
for sent in doc.sents:
   sents_list.append(sent.text)

print(sents_list)
print([token.text for token in doc])

OUTPUT

['Please read the analysis. (', 
"You'll be amazed.)"]

['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be', 
'amazed', '.', ')']

Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.

Upvotes: 8

Answers (2)

piyush

Reputation: 383

I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).

Below custom boundaries only works with sm model and behave different splitting with lg model.

nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ".(" or token.text == ").":
            doc[token.i+1].is_sent_start = True
        elif token.text == "Rs." or token.text == ")":
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)

for sent in doc.sents:
 print(sent.text)

Upvotes: 4

aab

Reputation: 11474

The sentencizer is a very fast but also very minimal sentence splitter that's not going to have good performance with punctuation like this. It's good for splitting texts into sentence-ish chunks, but if you need higher quality sentence segmentation, use the parser component of an English model to do sentence segmentation.

Upvotes: 1

Sentence split using spacy sentenizer

Answers (2)

Related Questions