Reputation: 211
I'm trying to understand the difference between entropy and cross-entropy, as I often hear about the entropy of a language and the cross-entropy of a language model, and I want to understand the link between the two.
To simplify things, let's consider a language (with a vocabulary) and a language model trained on that language.
We'll work at the character level (which gives us 26 characters), and a limited number of words (let's take the 20 names below).
prenoms = [
"Alice", "Alfred", "Alina", "Aline", "Alexandre",
"Alicia", "Alison", "Alma", "Alva", "Elise",
"Elisa", "Eliane", "Alain", "Amélie", "Arline",
"Olivier", "Oline", "Alva", "Eliott", "Julien"
]
How do we calculate the entropy over these 20 names (i.e., the entropy of our language) and the cross-entropy for our language model (let's take a unigram model or any language model you prefer to help me to understand)?
If you have a more relevant example, I’m open to it.
PS: My confusion comes from the fact that, in general definitions, we talk about a (language) distribution P when calculating entropy (without quite knowing how to calculate it), and about two distributions P and Q when calculating cross-entropy (where P is a one-hot encoding vector in this case, when calculating cross-entropy-loss).
PS2: A python code could help me to well understand, here is my understanding. I based it on y understanding of Jurasky Book (and https://huggingface.co/docs/transformers/perplexity)
def distribution_ngrams(text, n=4):
"""
"""
import math
from collections import Counter
ngrams = [text[i:i+n] for i in range(len(text)-n+1)]
counts = Counter(ngrams)
total = len(ngrams)
# Calculate the distribution of n-grams
distribution = {ngram: count/total for ngram, count in counts.items()}
return distribution
def language_entropy_ngrams(text, n_approx=4):
"""
Calculate an estimate of the entropy of a text using n-grams (normally, we take a very large n and consider an infinite sequence L)
"""
import math
distribution = distribution_ngrams(text, n_approx)
# Calculate entropy
entropy = -sum((p * math.log2(p)) for ngram,p in distribution.items())
entropy_rate = entropy / n_approx # normalize by the size of the n-gram
return entropy_rate
def model_cross_entropy(text,n_approx=4):
"""
Calculate the cross-entropy between the true text and the model's predictions
"""
import math
unigram_model_distribution = distribution_ngrams(text, 1)
language_model_distribution_approximation = distribution_ngrams(text, n_approx)
q = {}
cross_entropy = 0
for ngram,p in language_model_distribution_approximation.items():
q[ngram] = 1
for c in ngram:
q[ngram] = q[ngram]*unigram_model_distribution[c]
cross_entropy -= p*math.log2(q[ngram])
return cross_entropy/n_approx
if __name__ == "__main__":
prenoms = ["Alice", "Alfred", "Alina", "Aline", "Alexandre", "Alicia",
"Alison", "Alma", "Alva", "Elise", "Elisa", "Eliane", "Alain",
"Amélie", "Arline", "Olivier", "Oline", "Alva", "Eliott", "Julien"] #each prenonm can be seen as a sequence of characters
L = ''.join(prenoms).lower() #the corpus/language L can be seen as the concatenation of the sequences
print(language_entropy_ngrams(L))
print(model_cross_entropy(L))
Upvotes: 1
Views: 44