Reputation: 8937
Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:
WARN Untokenizable: � (U+FFFD, decimal: 65533)
Is there a way to disable them?
Upvotes: 0
Views: 561
Reputation: 9450
If you are working directly with a Tokenizer, the answer Denis Kulagin gives is good; if you are operating at the higher level of a StanfordCoreNLP pipeline, you can simply give the property (or equivalent command-line option):
tokenize.options = untokenizable=noneDelete
(to silently delete all unknown characters) or to silently keep them:
tokenize.options = untokenizable=noneKeep
Upvotes: 2
Reputation: 8937
One can do it that way:
Reader reader = new StringReader(paragraphText);
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain);
TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
documentPreprocessor.setTokenizerFactory(factory);
From here: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500
Upvotes: 0