Reputation: 45
I have some problems with the nltk.sent_tokenize
function.
My text (that I want to tokenize) consist of 54116 sentences that are separated by a dot. I removed other punctuation.
I would like to tokenize my text on a sentence level by using nltk.sent_tokenize
.
However, if I apply tokenized_text = sent_tokenize(mytext)
, the length of tokenized_text
is only 51582 instead of 54116.
Any ideas, why this could happen?
Kind regards
Upvotes: 0
Views: 138
Reputation: 159
This would typically happen because the model for Sentence Boundary Detection can not detect all sentence boundaries correctly — typically limited by its accuracy, which would be of the order of 97%-99%. That said, since you are claiming that the corpus has sentences strictly separated by a "dot", you may simply split it on '.', provided there are no abbreviations like Prof. or Dr. or Sr. etc. You may like to refer to https://www.aclweb.org/anthology/C12-2096.pdf for further details.
Upvotes: 1