Uzumaki Naruto
Uzumaki Naruto

Reputation: 547

CoreNLP MaxentTagger Data Format Error

I'm using Stanford CoreNLP for NLP processing and am in the process of training the POS tagger with more domain specific data. However, for some reason, the trainer is throwing "Data format error" when I run it with the properties file I've got. Here's the context:

Training file

Please#UH let#VBP us#PRP know#VB if#IN you#PRP have#VBP any#DT other#JJ thoughts#NNS that#WDT...

(Basically a very long 1-line word + tag set.)

Training properties file

         model = special_postagger.tagger
                  arch = words(-1,1),unicodeshapes(-1,1),order(2),suffix(4)
          wordFunction = 
             trainFile = /path/to/POS_trainer1.csv
       closedClassTags = 
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
                 debug = false
           debugPrefix = 
          tagSeparator = #
              encoding = UTF-8
            iterations = 100
                  lang = 
  learnClosedClassTags = false
      minFeatureThresh = 5
         openClassTags = 
rareWordMinFeatureThresh = 10
        rareWordThresh = 5
                search = qn
                  sgml = false
          sigmaSquared = 0.5
                 regL1 = 1.0
             tagInside = 
              tokenize = true
      tokenizerFactory = 
      tokenizerOptions = 
               verbose = false
        verboseResults = true
  veryCommonWordThresh = 250
              xmlInput = 
            outputFile = 
          outputFormat = slashTags
   outputFormatOptions = 
              nthreads = 1

Command Run

java edu.stanford.nlp.tagger.maxent.MaxentTagger -prop myProps.props

But for some reason, I get this error message:

warning: no language set, no open-class tags specified, and no closed-class tags specified; assuming ALL tags are open class tags
TaggerExperiments: adding word/tags
Exception in thread "main" java.lang.IllegalArgumentException: Data format error: can't find delimiter "#" in word "as" (line 2 of /path/to/POS_Trainer1.csv)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.primeNext(TextTaggedFileReader.java:74)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.<init>(TextTaggedFileReader.java:34)
at edu.stanford.nlp.tagger.io.TaggedFileRecord.reader(TaggedFileRecord.java:111)
at edu.stanford.nlp.tagger.maxent.ReadDataTagged.<init>(ReadDataTagged.java:52)
at edu.stanford.nlp.tagger.maxent.TaggerExperiments.<init>(TaggerExperiments.java:86)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.trainAndSaveModel(MaxentTagger.java:1140)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTraining(MaxentTagger.java:1207)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1839)

Upvotes: 0

Views: 274

Answers (1)

Uzumaki Naruto
Uzumaki Naruto

Reputation: 547

Answering my own question here: the training file must have a perfect format of [word][delimiter][tag], or else it will throw fatal runtime error. You can use whatever delimiter you want, such as the hashtag # symbol, for example, but if there are:

  • whitespaces
  • missing tags

between the [word][delimiter][tag] pattern, it will fail.

Upvotes: 1

Related Questions