Reputation: 95
I came across this class TokenizerME in opennlp documentation page(http://opennlp.apache.org/documentation/manual/opennlp.html). I am not getting how is it calculating the probabilies. I tested it with different inputs, still not understanding. Can someone help me understand the algorithm behind it? I wrote this sample code
public void tokenizerDemo(){
try {
InputStream modelIn = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("This is is book");
for(String t:tokens){
System.out.println("Token : "+t);
}
double tokenProbs[] = ((TokenizerME) tokenizer).getTokenProbabilities();
for(double tP : tokenProbs){
System.out.println("Token Prob : "+tP);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
I got this output
Token : This
Token : is
Token : is
Token : book
Token Prob : 1.0
Token Prob : 1.0
Token Prob : 1.0
Token Prob : 1.0
I want the token "is" to be counted twice and its probability should have been slightly higher than other tokens. Confused.
Upvotes: 1
Views: 222
Reputation: 11474
The tokenizer probabilities relate to the tokenizer's confidence in identifying the token spans themselves: whether this string of characters in this context is a token or not according to the tokenizer model. "This" at the beginning of a string with a following space is a very probable token for English, while "Thi" with a following "s" would not be.
The probabilities do not relate to how often a particular token content has been seen, just whether this sequence of characters is a probable token. The string "is is is is is is is" is easy to tokenize for English because "is" is a common word and spaces are good token boundaries. That's it.
If you are interested in calculating n-gram probabilities, you should look at language models instead. (You'll still need to tokenize your text first, obviously.)
Upvotes: 1