Reputation: 4602
i am creating a custom NER model using CoreNLP 3.6.0
My props are:
# location of the training file
trainFile = /home/damiano/stanford-ner.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
I build with this command:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop /home/damiano/stanford-ner.prop
The problem is when i use this model to retrieve the entities inside a textfile. The command is:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt
Where file.txt is:
Hello!
my
name
is
John.
The output is:
Hello/O !/O my/O name/O is/O John/PERSON ./O
As you can see it split "Hello!" into two tokens. Same thing for "John."
I must use whitespace tokenizer.
How can i set it?
why does CoreNlp is splitting those words in two tokens?
Upvotes: 2
Views: 790
Reputation: 4749
Upd. If you want to use whitespace tokenizer here, simply add look at Christopher Manning's answer. tokenize.whitespace=true
to your properties file.
However, and answering to your second question, 'why does CoreNlp is splitting those words in two tokens?', I'd suggest to keep the default tokenizer (which is PTBTokenizer), because it simply lets to obtain better results. Usually the reason to switch to whitespace tokenization is high demand to processing speed or (usually - and) low demand to tokenization quality. Since you are going to use it for further NER, I doubt that it is your case.
Even in your example, if you have token John.
after tokenization, it can not be captured by gazette or train examples.
More details and reasons why tokenization isn't that simple can be found here.
Upvotes: 1
Reputation: 9450
You set your own tokenizer by specifying the classname to the tokenizerFactory
flag/property:
tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory
You can specify any class that implements Tokenizer<T>
interface, but the included WhitespaceTokenizer
sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions
For instance, here, if you also specify:
tokenizerOptions = tokenizeNLs=true
then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).
Note: Options like tokenize.whitespace=true
apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.
As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.
Upvotes: 4