Katharina Baur
Katharina Baur

Reputation: 45

number of tokenized sentences does not match number of sentences in text

I have some problems with the nltk.sent_tokenize function.

My text (that I want to tokenize) consist of 54116 sentences that are separated by a dot. I removed other punctuation.
I would like to tokenize my text on a sentence level by using nltk.sent_tokenize.

However, if I apply tokenized_text = sent_tokenize(mytext), the length of tokenized_text is only 51582 instead of 54116.

Any ideas, why this could happen?

Kind regards

Upvotes: 0

Views: 138

Answers (1)

sks
sks

Reputation: 159

This would typically happen because the model for Sentence Boundary Detection can not detect all sentence boundaries correctly — typically limited by its accuracy, which would be of the order of 97%-99%. That said, since you are claiming that the corpus has sentences strictly separated by a "dot", you may simply split it on '.', provided there are no abbreviations like Prof. or Dr. or Sr. etc. You may like to refer to https://www.aclweb.org/anthology/C12-2096.pdf for further details.

Upvotes: 1

Related Questions