Helena Galhardas
Helena Galhardas

Reputation: 1

Stanford NER punctuation

We are using Stanford NER to train our own (CRF) classifier for French newspaper texts. We are having problems with punctuation, in particular Stanford NER seems to replace some punctuation marks by others.

Here is an example where the ' in "aujourd'hui" is replaced by ` and the « and » that enclose Ave-Maria are replaced by `` and ".

Input raw text:

" Aujourd'hui ... « Ave Maria » et ..."

Stanford NER output:

word    | tag | begin-offset | end-offset

Aujourd | O   | 31           | 38

`       | O   | 38           | 39

hui     | O   | 39           | 42


``      | O   | 331          | 332

Ave     | O   | 333          | 336

Maria   | O   | 337          | 342

''      | O   | 343          | 344

We have tested the following flags when creating the classifier:

-outputFormatOptions includePunctuationDependencies

-inputEncoding utf-8 

-outputEncoding utf-8

but none has worked.

I would appreciate any help.

Upvotes: 0

Views: 243

Answers (1)

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

Here is an example command tokenizing French text with the French tokenizer:

java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -file example-french-sentence-one.txt -outputFormat text

Note the tokenize property:

tokenize.language = fr

This will tell the tokenizer to use the French tokenizer.

That should handle the case of Aujourd'hui but unfortunately the guillemets are hard coded to be converted to " in the French lexer, and no option changes that behavior.

If I get a chance I'll try to push a change to the French tokenizer that sets that behavior as optional.

You can provide already tokenized text to a pipeline with the option tokenize.whitespace and just providing each token split by whitespace if you have another method to tokenize your text before submitting it to Stanford CoreNLP. Otherwise you might want to process your training data to match the way Stanford CoreNLP will tokenize it, that could be another option.

Upvotes: 2

Related Questions