Reputation: 91
I know the Shannon entropy for English is 1.0 to 1.5 bits per letter and some say as low as 0.6 to 1.3 bits per letter but I was was wondering is there a way to run an algorithm that looks at a large volume of text and then determine the expected value of the collective text is say .08 bits per letter of the collective text?
Upvotes: 9
Views: 4627
Reputation: 12937
The mathematical definition of the entropy rate of a language is, if you have a source that generates strings in that language, the limit of the entropy of the nth symbol, conditioned on the n-1 previous ones (assuming that the source is stationary).
A good enough approximation of such a source is a large corpus of English text. The Open national american corpus is pretty nice (100M characters, covers all types of written texts). Then the basic algorithm to approximate the limit above is to look, for a given n, at all n-grams that appear in the text, and build a statistical estimate of the various probabilities that occur in the definition of the conditional entropies involved in the calculation of the entropy rate.
The full source code to do it is short and simple (~40 lines of python code). I've done a blog post about estimating the entropy rate of English recently that goes into much more details, including mathematical definitions and a full implementation. It also includes references to various relevant papers, including Shannon's original article.
Upvotes: 6
Reputation: 31533
Oli Charlesworth is correct, entropy is defined on probabilities, not text.
The only true way one can generate a measure of disorder for data is to use Kolmogorov Complexity. Though this has problems too, in particular it's uncomputable and is not yet strictly well defined as one must arbitrarily pick a base language - as Oli puts it the "context". This well-definedness can be solved if the disorder one is measuring is relative to something that is going to process the data. So when considering compression on a particular computer, the base language would be Assembly for that computer.
So you could define the disorder of text as follows:
The length of the shortest program written in Assembly that outputs the text.
Upvotes: 0
Reputation: 171206
The Shannon entropy value for text is estimated. It is beyond human power to ever find out exactly. You can estimate it by running efficient compression algorithms over it (PAQ) or use humans to predict the next letter of a given string. Humans will do a good job because they apply semantic knowledge, not just statistical knowledge or syntactic knowledge.
Short answer: Try to compress the data/text you have as well as possible and calculate, how many bits you empirically needed.
It depends on the concrete algorithm what you can get the number down to. This will always just be an upper bound on the Shannon entropy (remember that the exact value will never be known).
Upvotes: 2