Akash
Akash

Reputation: 95

finding token probabilies in a text in nlp

I came across this class TokenizerME in opennlp documentation page(http://opennlp.apache.org/documentation/manual/opennlp.html). I am not getting how is it calculating the probabilies. I tested it with different inputs, still not understanding. Can someone help me understand the algorithm behind it? I wrote this sample code

public void tokenizerDemo(){
    try {
        InputStream modelIn = new FileInputStream("en-token.bin");
        TokenizerModel model = new TokenizerModel(modelIn);
        Tokenizer tokenizer = new TokenizerME(model);
        String tokens[] = tokenizer.tokenize("This is is book");
        for(String t:tokens){
            System.out.println("Token : "+t);
        }
        double tokenProbs[] = ((TokenizerME) tokenizer).getTokenProbabilities();
        for(double tP : tokenProbs){
            System.out.println("Token Prob : "+tP);
        }
    }
    catch (IOException e) {
      e.printStackTrace();
    }
}

I got this output

Token : This

Token : is

Token : is

Token : book

Token Prob : 1.0

Token Prob : 1.0

Token Prob : 1.0

Token Prob : 1.0

I want the token "is" to be counted twice and its probability should have been slightly higher than other tokens. Confused.

Upvotes: 1

Views: 222

Answers (1)

aab
aab

Reputation: 11474

The tokenizer probabilities relate to the tokenizer's confidence in identifying the token spans themselves: whether this string of characters in this context is a token or not according to the tokenizer model. "This" at the beginning of a string with a following space is a very probable token for English, while "Thi" with a following "s" would not be.

The probabilities do not relate to how often a particular token content has been seen, just whether this sequence of characters is a probable token. The string "is is is is is is is" is easy to tokenize for English because "is" is a common word and spaces are good token boundaries. That's it.

If you are interested in calculating n-gram probabilities, you should look at language models instead. (You'll still need to tokenize your text first, obviously.)

Upvotes: 1

Related Questions