Reputation: 547
I'm using Stanford CoreNLP for NLP processing and am in the process of training the POS tagger with more domain specific data. However, for some reason, the trainer is throwing "Data format error" when I run it with the properties file I've got. Here's the context:
Training file
Please#UH let#VBP us#PRP know#VB if#IN you#PRP have#VBP any#DT other#JJ thoughts#NNS that#WDT...
(Basically a very long 1-line word + tag set.)
Training properties file
model = special_postagger.tagger
arch = words(-1,1),unicodeshapes(-1,1),order(2),suffix(4)
wordFunction =
trainFile = /path/to/POS_trainer1.csv
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = #
encoding = UTF-8
iterations = 100
lang =
learnClosedClassTags = false
minFeatureThresh = 5
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.5
regL1 = 1.0
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 1
Command Run
java edu.stanford.nlp.tagger.maxent.MaxentTagger -prop myProps.props
But for some reason, I get this error message:
warning: no language set, no open-class tags specified, and no closed-class tags specified; assuming ALL tags are open class tags
TaggerExperiments: adding word/tags
Exception in thread "main" java.lang.IllegalArgumentException: Data format error: can't find delimiter "#" in word "as" (line 2 of /path/to/POS_Trainer1.csv)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.primeNext(TextTaggedFileReader.java:74)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.<init>(TextTaggedFileReader.java:34)
at edu.stanford.nlp.tagger.io.TaggedFileRecord.reader(TaggedFileRecord.java:111)
at edu.stanford.nlp.tagger.maxent.ReadDataTagged.<init>(ReadDataTagged.java:52)
at edu.stanford.nlp.tagger.maxent.TaggerExperiments.<init>(TaggerExperiments.java:86)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.trainAndSaveModel(MaxentTagger.java:1140)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTraining(MaxentTagger.java:1207)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1839)
Upvotes: 0
Views: 274
Reputation: 547
Answering my own question here: the training file must have a perfect format of [word][delimiter][tag], or else it will throw fatal runtime error. You can use whatever delimiter you want, such as the hashtag # symbol, for example, but if there are:
between the [word][delimiter][tag] pattern, it will fail.
Upvotes: 1