Reputation: 105497
I'm using Stanford Log-linear Part-Of-Speech Tagger and here is the sample sentence that I tag:
He can't do that
When tagged I get this result:
He_PRP ca_MD n't_RB do_VB that_DT
As you can see, can't
is split into two words, ca
is marked as Modal (MD) and n't
is marked as ADVERB (RB)?
I actually get the same result if I use can not
separately: can
is MD and not
is RB, so is this way of breaking up is expected instead of say breaking like can_MD
and 't_RB
?
Upvotes: 0
Views: 678
Reputation: 1873
Note: This is not the perfect answer.
I think that the problem originates from the Tokenizer used in Stanford POS Tagger, not from the tagger itself. the Tokenizer (PTBTokenizer) can not handle apostrophe properly:
1- Stanford PTBTokenizer token's split delimiter.
2- Stanford coreNLP - split words ignoring apostrophe.
As they mentioned here Stanford Tokenizer, the PTBTokenizer will tokenizes the sentence :
"Oh, no," she's saying, "our $400 blender can't handle something this hard!"
to:
......
our
$
400
blender
ca
n't
handle
something
Try to find a suitable tokenization method and apply it to the tagger as following:
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Sentence;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class Test {
public static void main(String[] args) throws Exception {
String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";
MaxentTagger tagger = new MaxentTagger(model);
List<HasWord> sent;
sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
//sent = Sentence.toWordList("He", "can't", "do", "that", ".");
List<TaggedWord> taggedSent = tagger.tagSentence(sent);
for (TaggedWord tw : taggedSent) {
System.out.print(tw.word() + "=" + tw.tag() + " , " );
}
}
}
output:
He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,
Upvotes: 1