Reputation: 663
What would be the best way to split a text without punctuation in Java on sentence level?
The text may contain multiple sentences without punctuation, e.g.:
String text = "i ate cornflakes it is a sunny day i have to wash my car";
String[] sentences = splitOnSentenceLevel(text);
System.out.print(Arrays.toString(sentences));
>>>["i ate cornflakes", "it is a sunny day", "i have to wash my car"]
The only solution I could find is to train an n-gram model that tells the probability of each position being the end of a sentence, trained on punctuated text data. But setting that up seems like a huge task.
public String[] splitOnSentenceLevel(String text) {
List<String> sentences = new ArrayList<String>();
String currentSentence = "";
for(String word: text.split(" ")) {
currentSentence += " " + word;
if(nGramClassifierIsLastWordOfSentence(word)) {
sentences.add(currentSentence);
currentSentence = "";
}
}
String[] sentencesArray = new String[ sentences.size() ];
sentences.toArray( sentencesArray );
return sentencesArray;
}
The Stanford CoreNLP toolkit doesn't seem to have such a feature either. The task is obviously ambiguous, but is there a simpler way of at least approximating a solution? The text I would like to analyze would contain relatively simple, short sentences.
Upvotes: 2
Views: 378