curious
curious

Reputation: 211

Understanding the Difference Between Entropy and Cross-Entropy in Language Models: Practical Example with Character-Level Unigram Model

I'm trying to understand the difference between entropy and cross-entropy, as I often hear about the entropy of a language and the cross-entropy of a language model, and I want to understand the link between the two.

To simplify things, let's consider a language (with a vocabulary) and a language model trained on that language.

We'll work at the character level (which gives us 26 characters), and a limited number of words (let's take the 20 names below).

prenoms = [
    "Alice", "Alfred", "Alina", "Aline", "Alexandre", 
    "Alicia", "Alison", "Alma", "Alva", "Elise", 
    "Elisa", "Eliane", "Alain", "Amélie", "Arline", 
    "Olivier", "Oline", "Alva", "Eliott", "Julien"
]

How do we calculate the entropy over these 20 names (i.e., the entropy of our language) and the cross-entropy for our language model (let's take a unigram model or any language model you prefer to help me to understand)?

If you have a more relevant example, I’m open to it.

PS: My confusion comes from the fact that, in general definitions, we talk about a (language) distribution P when calculating entropy (without quite knowing how to calculate it), and about two distributions P and Q when calculating cross-entropy (where P is a one-hot encoding vector in this case, when calculating cross-entropy-loss).

PS2: A python code could help me to well understand, here is my understanding. I based it on y understanding of Jurasky Book (and https://huggingface.co/docs/transformers/perplexity)

def distribution_ngrams(text, n=4):
    """
    """
    import math
    from collections import Counter 

    ngrams = [text[i:i+n] for i in range(len(text)-n+1)]
    counts = Counter(ngrams)
    total = len(ngrams)
    
    # Calculate the distribution of n-grams
    distribution = {ngram: count/total for ngram, count in counts.items()}
    return distribution

def language_entropy_ngrams(text, n_approx=4):
    """
    Calculate an estimate of the entropy of a text using n-grams (normally, we take a very large n and consider an infinite sequence L)
    """
    import math
    distribution = distribution_ngrams(text, n_approx)
    # Calculate entropy
    entropy = -sum((p * math.log2(p)) for ngram,p in distribution.items())
    entropy_rate = entropy / n_approx  # normalize by the size of the n-gram
    return entropy_rate  

def model_cross_entropy(text,n_approx=4):
    """
    Calculate the cross-entropy between the true text and the model's predictions
    """
    import math
    unigram_model_distribution =  distribution_ngrams(text, 1)
    language_model_distribution_approximation = distribution_ngrams(text, n_approx)

    q = {}
    cross_entropy = 0
    for ngram,p in language_model_distribution_approximation.items():
        q[ngram] = 1
        for c in ngram:
            q[ngram] = q[ngram]*unigram_model_distribution[c]
        cross_entropy -= p*math.log2(q[ngram])
        
    return cross_entropy/n_approx

if __name__ == "__main__":
    prenoms = ["Alice", "Alfred", "Alina", "Aline", "Alexandre", "Alicia", 
            "Alison", "Alma", "Alva", "Elise", "Elisa", "Eliane", "Alain", 
            "Amélie", "Arline", "Olivier", "Oline", "Alva", "Eliott", "Julien"] #each prenonm can be seen as a sequence of characters


    L = ''.join(prenoms).lower() #the corpus/language L can be seen as the concatenation of the sequences
    print(language_entropy_ngrams(L))
    print(model_cross_entropy(L))



Upvotes: 1

Views: 44

Answers (0)

Related Questions