How to set whitespace tokenizer on NER Model?

Question

i am creating a custom NER model using CoreNLP 3.6.0

My props are:

# location of the training file 
trainFile = /home/damiano/stanford-ner.tsv 
# location where you would like to save (serialize) your 
# classifier; adding .gz at the end automatically gzips the file, 
# making it smaller, and faster to load 
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that 
# the word is in column 0 and the correct answer is in column 1 
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features 
# apply at most to a class pair of previous class and current class 
# or current class and next class. 
maxLeft=1

# these are the features we'd like to train with 
# some are discussed below, the rest can be 
# understood by looking at NERFeatureFactory 
useClassFeature=true 
useWord=true 
# word character ngrams will be included up to length 6 as prefixes 
# and suffixes only  
useNGrams=true 
noMidNGrams=true 
maxNGramLeng=6 
usePrev=true 
useNext=true 
useDisjunctive=true 
useSequences=true 
usePrevSequences=true 
# the last 4 properties deal with word shape features 
useTypeSeqs=true 
useTypeSeqs2=true 
useTypeySequences=true 
wordShape=chris2useLC

I build with this command:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop /home/damiano/stanford-ner.prop

The problem is when i use this model to retrieve the entities inside a textfile. The command is:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt

Where file.txt is:

Hello!
my
name
is
John.

The output is:

Hello/O !/O my/O name/O is/O John/PERSON ./O

As you can see it split "Hello!" into two tokens. Same thing for "John."

I must use whitespace tokenizer.

How can i set it?

why does CoreNlp is splitting those words in two tokens?

Christopher Manning · Accepted Answer

You set your own tokenizer by specifying the classname to the tokenizerFactory flag/property:

tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory

You can specify any class that implements Tokenizer interface, but the included WhitespaceTokenizer sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions For instance, here, if you also specify:

tokenizerOptions = tokenizeNLs=true

then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).

Note: Options like tokenize.whitespace=true apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.

As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.

How to set whitespace tokenizer on NER Model?

Answers (2)

Related Questions