Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

Question

In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this

where V is the size of the vocabulary. My question is why do we add V when we are only considering the count of the previous word.

I only have a rough idea that every word is incremented by one so we have to normalize it by V time but I still don't understand it properly. As I said we are only considering the count of previous word right so why don't just add 1 to it.

I also saw that if we add V then the addition of all bigrams will be 1 which is what it should be. But is there any other explanation of why V?

alvas · Accepted Answer

The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from:

First, we start with the naive assumption that if we add 1 to the numerator, we also add 1 to denominator to avoid math division error.

But instead of adding +1 to all terms in the vocabulary, we could simply add the size of the vocab, thus you see the sum(c(wi-1)) + |V| in the denominator, instead of sum(c(wi-1) + 1), note the scope of the "sum" function.

More details

Sometimes I find it easier to see the math in code, consider this ngram without laplace:

# From https://docs.python.org/3/library/itertools.html#itertools.pairwise

try:
    from itertools import pairwise
except:
    from itertools import tee
    def pairwise(iterable):
        # pairwise('ABCDEFG') --> AB BC CD DE EF FG
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)

    
def count_numerator(ngrams, sentences):
    return sum([list(pairwise(s)).count(ngrams) for s in sentences])

def count_denominator(word, sentences):
    return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) for s in sentences])


s1 = ["", "this", "is", "foo", "bar", '']
s2 = ["", "foo", "bar", "is", "good", '']
s3 = ["", "fruit", "bar", "is", "better", '']
s4 = ["", "this", "is", "good", ""]
s5 = ["", "i", "like", "this", "thing", '']

sentences = [s1, s2, s3, s4, s5]


prob_bos_this = count_numerator(('', 'this'), sentences) / count_denominator("", sentences) 

prob_this_is = count_numerator(('this', 'is'), sentences) / count_denominator("this", sentences)

prob_is_good = count_numerator(('is', 'good'), sentences) / count_denominator("is", sentences)

prob_good_eos = count_numerator(('good', ''), sentences) / count_denominator("good", sentences)

print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos

print_this_is_good # Outputs: 0.1333

Now consider the laplace smoothing with the incremental +1 on the numerator and denominator:

def count_numerator_laplace(ngrams, sentences): return sum([list(pairwise(s)).count(ngrams) + 1 for s in sentences]) def count_denominator_laplace(word, sentences): return sum([sum([int(ng[0] == word) + 1 for ng in pairwise(s)]) for s in sentences]) prob_bos_this = count_numerator_laplace(('', 'this'), sentences) / count_denominator_laplace("", sentences) prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences) prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences) prob_good_eos = count_numerator_laplace(('good', ''), sentences) / count_denominator_laplace("good", sentences) print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos print_this_is_good # 0.004212103350034384

Note that the +1 on the numerator is adding simple +1 to each wi-1, wi count. The denominator is adding +1 for each ngram that exists in the corpus containing the wi-1, *.

Now, we see that we are summing +1 for every ngrams that occurs in a similar fashion, we can just add the no. of ngrams that exists, e.g.

def count_numerator_laplace(ngrams, sentences): return sum([list(pairwise(s)).count(ngrams) + 1 for s in sentences]) def count_denominator_laplace(word, sentences): return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) + len(list(pairwise(s))) for s in sentences]) prob_bos_this = count_numerator_laplace(('', 'this'), sentences) / count_denominator_laplace("", sentences) prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences) prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences) prob_good_eos = count_numerator_laplace(('good', ''), sentences) / count_denominator_laplace("good", sentences) print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos print_this_is_good # 0.004212103350034384

This is a good prove of concept for how the +1 in the inner sum becomes a + |V| in when you carry the +1 out of the summation, e.g.

my_list = [1,2,3] sum(i+1 for i in my_list) == sum(i for i in my_list) + len(my_list)

Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

Answers (1)

More details

Related Questions