Dail
Dail

Reputation: 4602

How to set whitespace tokenizer on NER Model?

i am creating a custom NER model using CoreNLP 3.6.0

My props are:

# location of the training file 
trainFile = /home/damiano/stanford-ner.tsv 
# location where you would like to save (serialize) your 
# classifier; adding .gz at the end automatically gzips the file, 
# making it smaller, and faster to load 
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that 
# the word is in column 0 and the correct answer is in column 1 
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features 
# apply at most to a class pair of previous class and current class 
# or current class and next class. 
maxLeft=1

# these are the features we'd like to train with 
# some are discussed below, the rest can be 
# understood by looking at NERFeatureFactory 
useClassFeature=true 
useWord=true 
# word character ngrams will be included up to length 6 as prefixes 
# and suffixes only  
useNGrams=true 
noMidNGrams=true 
maxNGramLeng=6 
usePrev=true 
useNext=true 
useDisjunctive=true 
useSequences=true 
usePrevSequences=true 
# the last 4 properties deal with word shape features 
useTypeSeqs=true 
useTypeSeqs2=true 
useTypeySequences=true 
wordShape=chris2useLC

I build with this command:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop /home/damiano/stanford-ner.prop

The problem is when i use this model to retrieve the entities inside a textfile. The command is:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt

Where file.txt is:

Hello!
my
name
is
John.

The output is:

Hello/O !/O my/O name/O is/O John/PERSON ./O

As you can see it split "Hello!" into two tokens. Same thing for "John."

I must use whitespace tokenizer.

How can i set it?

why does CoreNlp is splitting those words in two tokens?

Upvotes: 2

Views: 790

Answers (2)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

Upd. If you want to use whitespace tokenizer here, simply add tokenize.whitespace=true to your properties file. look at Christopher Manning's answer.

However, and answering to your second question, 'why does CoreNlp is splitting those words in two tokens?', I'd suggest to keep the default tokenizer (which is PTBTokenizer), because it simply lets to obtain better results. Usually the reason to switch to whitespace tokenization is high demand to processing speed or (usually - and) low demand to tokenization quality. Since you are going to use it for further NER, I doubt that it is your case.

Even in your example, if you have token John. after tokenization, it can not be captured by gazette or train examples. More details and reasons why tokenization isn't that simple can be found here.

Upvotes: 1

Christopher Manning
Christopher Manning

Reputation: 9450

You set your own tokenizer by specifying the classname to the tokenizerFactory flag/property:

tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory

You can specify any class that implements Tokenizer<T> interface, but the included WhitespaceTokenizer sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions For instance, here, if you also specify:

tokenizerOptions = tokenizeNLs=true

then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).

Note: Options like tokenize.whitespace=true apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.

As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.

Upvotes: 4

Related Questions