Reputation: 1863
I faced a problem with stanford's Sentence annotator. As an input I've got the text, which contains sentences, but there is no whitespace after dot in some parts of it. Like this:
Dog loves cat.Cat loves mouse. Mouse hates everybody.
So when I'm trying to use SentenceAnnotator - I'm getting 2 sentences
Dog loves cat.Cat loves mouse.
Mouse hates everybody.
Here is my code
Annotation doc = new Annotation(t);
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(doc);
List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
I also tried to add property
props.setProperty("ssplit.boundaryTokenRegex", "\\.");
but no effect.
Maybe I'm missing something? Thanks!
UPD Also I tried to tokenize text using PTBTokenizer
PTBTokenizer ptbTokenizer = new PTBTokenizer(
new FileReader(classLoader.getResource("simplifiedParagraphs.txt").getFile())
,new WordTokenFactory()
,"untokenizable=allKeep,tokenizeNLs=true,ptb3Escaping=true,strictTreebank3=true,unicodeEllipsis=true");
List<String> strings = ptbTokenizer.tokenize();
but tokenizer thinks that cat.Cat is single word and doesn't split it.
Upvotes: 0
Views: 1008
Reputation: 11494
This is a pipeline where the sentence splitter is going to identify sentence boundaries for the tokens provided by the tokenizer, but the sentence splitter only groups adjacent tokens into sentences, it doesn't try to merge or split them.
As you found, I think that the ssplit.boundaryTokenRegex
property would tell the sentence splitter to end a sentence when it sees "." as a token, but this doesn't help in cases where the tokenizer hasn't split the "." apart from surrounding text into a separate token.
You will need to either:
None of the standard English tokenizers, which are typically intended to be used with newspaper text, have been developed to handle this kind of text.
Some related questions:
Does the NLTK sentence tokenizer assume correct punctuation and spacing?
How to split text into sentences when there is no space after full stop?
Upvotes: 2