Reputation: 111
How to find out the entropy of the English language by using isolated symbol probabilities of the language?
Upvotes: 10
Views: 10859
Reputation: 69997
If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following:
Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible
Iterate through its characters and count the frequency of occurrence of each of them
Use the frequency, divided by the total number of characters, as estimate for each character's probability
Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)
Take the sum of all average lengths of all characters. That is the result.
Caveats:
This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.
Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.
The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.
Code example:
Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN.
import re
import sys
from math import log
# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
return log(number) / log(2)
# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
return cleaner.sub(' ',text)
# Dictionary for letter counts
letter_frequency = {}
# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())
# Count letter frequencies
for letter in text:
if letter in letter_frequency:
letter_frequency[letter] += 1
else:
letter_frequency[letter] = 1
# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
probability = float(letter_frequency[letter]) / len(text)
length_sum += probability * log2(probability)
# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))
Upvotes: 18