smwikipedia
smwikipedia

Reputation: 64333

Why can we use entropy to measure the quality of language model?

I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:

...The essential point here is that if a model captures more of the structure of a language, then the entropy of the model should be lower. In other words, we can sue entropy as a measure of the quality of our models...

But how about this example:

Suppose we have a machine that spit 2 characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.

I am not the designer. And I try to model it through experiment.

During a initial experiment, I see the machine split the following character sequence:

A, B, A

So I model the machine as P(A)=2/3 and P(B)=1/3. And we can calculate entropy of this model as :

-2/3*Log(2/3)-1/3*Log(1/3)= 0.918 bit  (the base is 2)

But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:

P(A)=1/2 P(B)=1/2

And the entropy of this new model is:

-1/2*Log(1/2)-1/2*Log(1/2) = 1 bit

The second model is obviously better than the first one. But the entropy increased.

My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.

Could anyone shed some light on this?

ADD 1

(Much thanks to Rob Neuhaus!)

Yes, after I re-digested the mentioned NLP book. I think I can explain it now.

What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:

[-500*log(1/3) - 500*log(2/3)]/1000 = 1/2 * Log(9/2)

While the correct 1/2-1/2 model will give:

[-500*log(1/2) - 500*log(1/2)]/1000 = 1/2 * Log(8/2)

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.

Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.

I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.

Upvotes: 4

Views: 1640

Answers (1)

Rob Neuhaus
Rob Neuhaus

Reputation: 9290

If you had a larger sample of data, it's very likely that the model that assigns 2/3 to A and 1/3 to B will do worse than the true model, which gives 1/2 to each. The problem is that your training set is too small, so you were mislead into thinking the wrong model was better. I encourage you to experiment, generate a random string of length 10000, where each character equally likely. Then measure the cross entropy of the 2/3,1/3 model vs the 1/2,1/2 model on that much longer string. I am sure you will see the latter performs better. Here is some sample Python code demonstrating the fact.

from math import log
import random

def cross_entropy(prediction_probability_seq):
    probs = list(prediction_probability_seq)
    return -sum(log(p, 2) for p in probs) / len(probs)

def predictions(seq, model):
    for item in seq:
         yield model[item]

rand_char_seq = [random.choice(['a', 'b']) for _ in xrange(1000)]

def print_ent(m):
    print 'cross entropy of', str(m), \
        cross_entropy(predictions(rand_char_seq, m))

print_ent({'a': .5, 'b': .5})
print_ent({'a': 2./3, 'b': 1./3})

Notice that if you add an extra 'a' to the choice, then the second model (which is closer to the true distribution) gets lower cross entropy than the first.

However, one other thing to consider is that you really want to measure the likelihood on held out data that you didn't observe during training. If you do not do this, more complicated models that memorize the noise in the training data will have an advantage over smaller/simpler models that don't have as much ability to memorize noise.

One real problem with likelihood as measuring language model quality is that it sometimes doesn't perfectly predict the actual higher level application error rate. For example, language models are often used in speech recognition systems. There have been improved language models (in terms of entropy) that didn't drive down the overall system's word error rate, which is what the designers really care about. This can happen if the language model improves predictions where the entire recognition system is already confident enough to get the right answer.

Upvotes: 2

Related Questions