Denis Kulagin
Denis Kulagin

Reputation: 8937

Stanford NLP: how to disable warnings?

Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:

WARN  Untokenizable: � (U+FFFD, decimal: 65533)

Is there a way to disable them?

Upvotes: 0

Views: 561

Answers (2)

Christopher Manning
Christopher Manning

Reputation: 9450

If you are working directly with a Tokenizer, the answer Denis Kulagin gives is good; if you are operating at the higher level of a StanfordCoreNLP pipeline, you can simply give the property (or equivalent command-line option):

tokenize.options = untokenizable=noneDelete

(to silently delete all unknown characters) or to silently keep them:

tokenize.options = untokenizable=noneKeep

Upvotes: 2

Denis Kulagin
Denis Kulagin

Reputation: 8937

One can do it that way:

Reader reader = new StringReader(paragraphText);
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain);

TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
documentPreprocessor.setTokenizerFactory(factory);

From here: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500

Upvotes: 0

Related Questions