Max Koretskyi
Max Koretskyi

Reputation: 105497

Why POS tagging algorithm tags `can't` as separate words?

I'm using Stanford Log-linear Part-Of-Speech Tagger and here is the sample sentence that I tag:

He can't do that

When tagged I get this result:

He_PRP ca_MD n't_RB do_VB that_DT

As you can see, can't is split into two words, ca is marked as Modal (MD) and n't is marked as ADVERB (RB)?

I actually get the same result if I use can not separately: can is MD and not is RB, so is this way of breaking up is expected instead of say breaking like can_MD and 't_RB?

Upvotes: 0

Views: 678

Answers (1)

houssam
houssam

Reputation: 1873

Note: This is not the perfect answer.
I think that the problem originates from the Tokenizer used in Stanford POS Tagger, not from the tagger itself. the Tokenizer (PTBTokenizer) can not handle apostrophe properly:
1- Stanford PTBTokenizer token's split delimiter.
2- Stanford coreNLP - split words ignoring apostrophe.
As they mentioned here Stanford Tokenizer, the PTBTokenizer will tokenizes the sentence :

"Oh, no," she's saying, "our $400 blender can't handle something this hard!"

to:

......
our
$
400
blender
ca
n't
handle
something

Try to find a suitable tokenization method and apply it to the tagger as following:

    import java.util.List;
    import edu.stanford.nlp.ling.HasWord;
    import edu.stanford.nlp.ling.Sentence;
    import edu.stanford.nlp.ling.TaggedWord;
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;

    public class Test {

        public static void main(String[] args) throws Exception {
            String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";  
            MaxentTagger tagger = new MaxentTagger(model);
            List<HasWord> sent;
            sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
            //sent = Sentence.toWordList("He", "can't", "do", "that", ".");
            List<TaggedWord> taggedSent = tagger.tagSentence(sent);
            for (TaggedWord tw : taggedSent) {
                 System.out.print(tw.word() + "=" +  tw.tag() + " , " );

            }

        }

}

output:

He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,

Upvotes: 1

Related Questions