number of tokenized sentences does not match number of sentences in text

Question

I have some problems with the nltk.sent_tokenize function.

My text (that I want to tokenize) consist of 54116 sentences that are separated by a dot. I removed other punctuation.
I would like to tokenize my text on a sentence level by using nltk.sent_tokenize.

However, if I apply tokenized_text = sent_tokenize(mytext), the length of tokenized_text is only 51582 instead of 54116.

Any ideas, why this could happen?

Kind regards

number of tokenized sentences does not match number of sentences in text

Answers (1)

Related Questions