Danila Zharenkov
Danila Zharenkov

Reputation: 1863

Stanford coreNLP splitting paragraph sentences without whitespace

I faced a problem with stanford's Sentence annotator. As an input I've got the text, which contains sentences, but there is no whitespace after dot in some parts of it. Like this:

Dog loves cat.Cat loves mouse. Mouse hates everybody.

So when I'm trying to use SentenceAnnotator - I'm getting 2 sentences

Dog loves cat.Cat loves mouse.

Mouse hates everybody.

Here is my code

Annotation doc = new Annotation(t);
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(doc);
List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);

I also tried to add property

props.setProperty("ssplit.boundaryTokenRegex", "\\.");

but no effect.

Maybe I'm missing something? Thanks!

UPD Also I tried to tokenize text using PTBTokenizer

PTBTokenizer ptbTokenizer = new PTBTokenizer(
        new FileReader(classLoader.getResource("simplifiedParagraphs.txt").getFile())
        ,new WordTokenFactory()
        ,"untokenizable=allKeep,tokenizeNLs=true,ptb3Escaping=true,strictTreebank3=true,unicodeEllipsis=true");
List<String> strings = ptbTokenizer.tokenize();

but tokenizer thinks that cat.Cat is single word and doesn't split it.

Upvotes: 0

Views: 1008

Answers (1)

aab
aab

Reputation: 11494

This is a pipeline where the sentence splitter is going to identify sentence boundaries for the tokens provided by the tokenizer, but the sentence splitter only groups adjacent tokens into sentences, it doesn't try to merge or split them.

As you found, I think that the ssplit.boundaryTokenRegex property would tell the sentence splitter to end a sentence when it sees "." as a token, but this doesn't help in cases where the tokenizer hasn't split the "." apart from surrounding text into a separate token.

You will need to either:

  • preprocess your text (insert a space after "cat."),
  • postprocess your tokens or sentences to split cases like this, or
  • find/develop a tokenizer that can split "cat.Cat" into three tokens.

None of the standard English tokenizers, which are typically intended to be used with newspaper text, have been developed to handle this kind of text.

Some related questions:

Does the NLTK sentence tokenizer assume correct punctuation and spacing?

How to split text into sentences when there is no space after full stop?

Upvotes: 2

Related Questions