Reputation: 191
Assume I have an array of words. E.g. {"I", "like", "melons","Susan", "likes", "apples"}(Only a very simple example) I want to find where I should add a period, AKA where I should separate the sentence. So the answer would be "I like melons." "Susan likes apples."
The capitalization could give some hints. But a capitalized word does not guarantee a start word(first word of the sentence). For example, abbreviations like NBA, USA, country name like America, Canada, they are capitalized but can be in the middle of a sentence.
What algorithm can be used to do the work?
Upvotes: 0
Views: 123
Reputation: 657
Without building a classifier and training it on a large corpus, I think looking for a period followed by a capitalized word is the only simple approach. It is also possible to find long lists of capitalized abbreviation words like that (as well as proper nouns potentially), which could help you.
NLTK has some good tools for that, using a combination of those approaches I believe, and gets very good precision.
Upvotes: 1