minhle_r7
minhle_r7

Reputation: 872

Why does Stanford POS tagger modify input sentence?

I took this sentence from Wall Street Journal and passed it through Stanford POS tagger. Strangely, the tagger changed "theatre" into "theater"

The command:

java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree

The property file:

## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
                    arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
            wordFunction = edu.stanford.nlp.process.AmericanizeFunction
         closedClassTags =
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix =
            tagSeparator = _
                encoding = UTF-8
              iterations = 100
                    lang = english
    learnClosedClassTags = false
        minFeatureThresh = 2
           openClassTags =
rareWordMinFeatureThresh = 5
          rareWordThresh = 5
                  search = owlqn2
                    sgml = false
            sigmaSquared = 0.5
                   regL1 = 0.75
               tagInside =
                tokenize = true
        tokenizerFactory =
        tokenizerOptions =
                 verbose = false
          verboseResults = true
    veryCommonWordThresh = 250
                xmlInput =
              outputFile =
            outputFormat = slashTags
     outputFormatOptions =
                nthreads = 4

The input sentence:

( (SINV (`` ``) (S-TPC-2 (PP (IN Without) (NP (DT some) (JJ unexpected) (`` ``) (FW coup) (FW de) (FW theatre) ('' ''))) (, ,) (NP-SBJ (PRP I)) (VP (VBP do) (RB n't) (VP (VB see) (SBAR (WHNP-1 (WP what)) (S (NP-SBJ-1 (-NONE- T)) (VP (MD will) (VP (VB block) (NP (DT the) (NNP Paribas) (NN bid))))))))) (, ,) ('' '') (VP (VBD said) (S-2 (-NONE- T))) (NP-SBJ (NP (NNP Philippe) (NNP de) (NNP Cholet)) (, ,) (NP (NP (NN analyst)) (PP-LOC (IN at) (NP (NP (DT the) (NN brokerage)) (NP (NNP Cholet) (HYPH -) (NNP Dupont) (CC &) (NNP Cie)))))) (. .)) )

The output:

``_`` Without_IN some_DT unexpected_JJ ``_`` coup_NN de_IN theater_NN ''_'' ,_, I_PRP do_VBP n't_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN ,_, ''_'' said_VBD Philippe_NNP de_IN Cholet_NNP ,_, analyst_NN at_IN the_DT brokerage_NN Cholet_NNP -_HYPH Dupont_NNP &_CC Cie_NNP ._.

Upvotes: 0

Views: 185

Answers (1)

Jon Gauthier
Jon Gauthier

Reputation: 25572

As I understand it, the Stanford POS tagger is trained with US English training data. At runtime we "Americanize" the input data in order to make sure it is recognized properly by the tagger. See this line in your configuration file:

wordFunction = edu.stanford.nlp.process.AmericanizeFunction

If you are accessing CoreNLP programmatically, you can retrieve the pre-Americanized form via CoreLabel.originalText. You could also just disable the AmericanizeFunction, but you might see some incorrect outputs as a result.

Upvotes: 2

Related Questions