bear
bear

Reputation: 663

Splitting raw text on sentence level

What would be the best way to split a text without punctuation in Java on sentence level?

The text may contain multiple sentences without punctuation, e.g.:

String text = "i ate cornflakes it is a sunny day i have to wash my car";
String[] sentences = splitOnSentenceLevel(text);
System.out.print(Arrays.toString(sentences));
>>>["i ate cornflakes", "it is a sunny day", "i have to wash my car"]

The only solution I could find is to train an n-gram model that tells the probability of each position being the end of a sentence, trained on punctuated text data. But setting that up seems like a huge task.

public String[] splitOnSentenceLevel(String text) {
    List<String> sentences = new ArrayList<String>();
    String currentSentence = "";
    for(String word: text.split(" ")) {
        currentSentence += " " + word;
        if(nGramClassifierIsLastWordOfSentence(word)) {
            sentences.add(currentSentence);
            currentSentence = "";
        }
    }
    String[] sentencesArray = new String[ sentences.size() ];
    sentences.toArray( sentencesArray );
    return sentencesArray;
}

The Stanford CoreNLP toolkit doesn't seem to have such a feature either. The task is obviously ambiguous, but is there a simpler way of at least approximating a solution? The text I would like to analyze would contain relatively simple, short sentences.

Upvotes: 2

Views: 378

Answers (0)

Related Questions