Reputation: 26932
Is there a standard way to do this?
Googling -- "approximate entropy" bits -- uncovers multiple academic papers but I'd like to just find a chunk of pseudocode defining the approximate entropy for a given bit string of arbitrary length.
(In case this is easier said than done and it depends on the application, my application involves 16,320 bits of encrypted data (cyphertext). But encrypted as a puzzle and not meant to be impossible to crack. I thought I'd first check the entropy but couldn't easily find a good definition of such. So it seemed like a question that ought to be on StackOverflow! Ideas for where to begin with de-cyphering 16k random-seeming bits are also welcome...)
See also this related question:
What is the computer science definition of entropy?
Upvotes: 60
Views: 62180
Reputation: 891
Expanding on fmark's answer, for the specific task of "See how random 16,320bits of data are":
You likely want to look at more than the single-byte frequencies to determine patterns in your data. So you can chunk it into multi-chunk symbols and then analyze the distribution there. There's probably a more advanced solution based on the entropy of prefix -> continuation, but:
def get_entropy_chunk_length(binary_corpus: bytes, chunk_size=1):
# Split 'abcd' into ['ab', 'bc', 'cd']
chunks = [binary_corpus[i:i+chunk_size] for i in range(len(binary_corpus) - chunk_size)]
# Count the number of times each unique chunk occurs.
occurrences = {chunk: binary_corpus.count(chunk) for chunk in set(chunks)}
# Divide by number of chunks.
frequencies = {chunk: count / len(chunks) for chunk, count in occurrences.items()}
# Create a list we can use to look at the most frequent chunks
sorted_occurrences = list(occurrences.items())
sorted_occurrences.sort(key=lambda x: x[1], reverse=True)
# Compute entropy if we randomly generated chunks from the given distribution.
entropy_per_symbol = -sum([freq * math.log2(freq) for freq in frequencies.values()])
entropy_per_byte = entropy_per_symbol / chunk_size
return entropy_per_byte, sorted_occurrences
def get_entropy_data(binary_corpus: bytes, chunk_size=1):
entropy_per_byte, occurrences = get_entropy_chunk_length(binary_corpus, chunk_size)
unused_substring_ratio = (8 * chunk_size) - math.log2(len(occurrences))
print(f"Chunk-size {chunk_size}:\t{entropy_per_byte:.2f} bits of entropy/byte")
print(f"\tFraction of substrings used: 1 in 2^{unused_substring_ratio:.1f} ({2 ** unused_substring_ratio:.1e})")
print(f"\tMost frequent substrings:")
for rank in range(3):
print(f'\t\t#{rank} ({occurrences[rank][1]}):\t{repr(occurrences[rank][0])}')
with open('myfile.data', 'rb') as f:
data_str = f.read()
for i in [1, 2, 4, 32]:
get_entropy_data(data_str, i)
When we run this on a random .zip file I have:
Chunk-size 1: 7.18 bits of entropy/byte
Fraction of substrings used: 1 in 2^0.0 (1.0e+00)
Most frequent substrings:
#0 (1835): b'\x00'
#1 (256): b'\x08'
#2 (174): b'o'
Chunk-size 2: 5.51 bits of entropy/byte
Fraction of substrings used: 1 in 2^3.2 (9.2e+00)
Most frequent substrings:
#0 (744): b'\x00\x00'
#1 (106): b'PK'
#2 (93): b'\x14\x00'
Chunk-size 4: 2.96 bits of entropy/byte
Fraction of substrings used: 1 in 2^19.0 (5.1e+05)
Most frequent substrings:
#0 (258): b'\x00\x00\x00\x00'
#1 (58): b'\x00\x08\x08\x08'
#2 (58): b'\x14\x00\x08\x08'
Chunk-size 32: 0.42 bits of entropy/byte
Fraction of substrings used: 1 in 2^242.4 (9.5e+72)
Most frequent substrings:
#0 (6): ...
We can see that an issue with chunking by blocks of 4 bytes is that the file doesn't have most 4-byte strings. A string occurring 1 time isn't signal the way '\x00\x00\x00\x00'
occurring 258 times is. It's just a reflection of having a relatively small sample of the overall random distribution. So in essence, at chunk size 32, it's not really estimating entropy, it's estimating how hard it would be to hard-code every 32 byte sequence that happens to occur in this file.
We can fix this by using a slightly more informed model for the channel by estimating a "probability of informative data", by saying that every chunk that occurs at most once is probably uniformly distributed random data. So our new theoretical channel is:
p_informative
, generate data sampled from repeated chunksp_informative
, generate random bytes.We can break those down and get each entropy, and sum them up:
def get_entropy_chunk_length(binary_corpus: bytes, chunk_size=1):
# Split 'abcd' into ['ab', 'bc', 'cd']
chunks = [binary_corpus[i:i+chunk_size] for i in range(len(binary_corpus) - chunk_size)]
# Count the number of times each unique chunk occurs.
occurrences = {chunk: binary_corpus.count(chunk) for chunk in set(chunks)}
sum_of_all_counts = sum([count for count in occurrences.values()])
# Only keep chunks that happen at least twice.
repeat_occurrences = {chunk: count for chunk, count in occurrences.items() if count > 1}
# Compute the frequency of each outcome, conditioned on it being a repeated outcome
sum_of_repeat_counts = sum([count for count in repeat_occurrences.values()])
conditional_frequencies = {chunk: count / sum_of_repeat_counts for chunk, count in repeat_occurrences.items()}
conditional_entropy_repeated = -sum([freq * math.log2(freq) for freq in conditional_frequencies.values()])
# We want to pretend that every random binary
# We split this into two cases, an "informative" outcome and an "uninformative" outcome.
# In the uninformative outcome, it's just uniform across all possible chunks
# In the informative outcome, it's distributed according to `occurrences`.
# The total entropy is the weighted sum of the two sub-entropies, plus the entropy of the initial choice between
# informative and uninformative.
# We use the number of things that occurred at least twice as a proxy for how much of the data is random vs repeated
p_informative = sum_of_repeat_counts / sum_of_all_counts
if p_informative == 0 or p_informative == 1:
initial_choice_entropy = 0
else:
initial_choice_entropy = -(
p_informative * math.log2(p_informative)
+ (1-p_informative) * math.log2(1-p_informative)
)
uninformative_entropy = 8 * chunk_size
informative_entropy = conditional_entropy_repeated
# Create a list we can use to look at the most frequent chunks
sorted_occurrences = list(repeat_occurrences.items())
sorted_occurrences.sort(key=lambda x: x[1], reverse=True)
total_entropy = initial_choice_entropy + (1-p_informative) * uninformative_entropy + p_informative * informative_entropy
entropy_per_byte = total_entropy / chunk_size
return (
entropy_per_byte,
sorted_occurrences,
{
'initial': initial_choice_entropy,
'uninformed': uninformative_entropy,
'informed': informative_entropy,
'p_informative': p_informative,
}
)
And that gives a lot more reasonable results. You can see that the entropy stops decreasing with increasingly larger chunks, as less and less data of that size gets repeated.
Chunk-size 1: 7.18 bits of entropy/byte
initial: 0 bits
uninformed (0.0%): 8.0 bits (0.0 bits)
informed (100.0%): 7.2 bits (7.2 bits)
Most frequent substrings:
#0 (1835): b'\x00'
#1 (256): b'\x08'
#2 (174): b'o'
Chunk-size 2: 6.67 bits of entropy/byte
initial: 1.00 bits
uninformed (51.8%): 16.0 bits (8.3 bits)
informed (48.2%): 8.4 bits (4.0 bits)
Most frequent substrings:
#0 (744): b'\x00\x00'
#1 (106): b'PK'
#2 (93): b'\x14\x00'
Chunk-size 8: 6.13 bits of entropy/byte
initial: 0.86 bits
uninformed (71.1%): 64.0 bits (45.5 bits)
informed (28.9%): 9.2 bits (2.7 bits)
Most frequent substrings:
#0 (86): b'\x00\x00\x00\x00\x00\x00\x00\x00'
#1 (47): b'Y\x00\x00\x00\x00\x00\x00\x00'
#2 (46): b'\x00\x00option'
Chunk-size 24: 7.52 bits of entropy/byte
initial: 0.3428458286721345 bits
uninformed (93.6%): 192.0 bits (179.7 bits)
informed (6.4%): 7.9 bits (0.5 bits)
Most frequent substrings:
...
Upvotes: 0
Reputation: 8608
Here's an implementation in Python (I also added it to the Wiki page):
import numpy as np
def ApEn(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return -(N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return _phi(m) - _phi(m + 1)
Example:
>>> U = np.array([85, 80, 89] * 17)
>>> ApEn(U, 2, 3)
-1.0996541105257052e-05
The above example is consistent with the example given on Wikipedia.
Upvotes: 1
Reputation: 746
Using Shannon entropy of a word with this formula : https://i.sstatic.net/GBBJG.jpg
Here's a O(n) algorithm that calculates it :
import math
from collections import Counter
def entropy(s):
l = float(len(s))
return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))
Upvotes: 3
Reputation: 58547
Shannon's entropy equation is the standard method of calculation. Here is a simple implementation in Python, shamelessly copied from the Revelation codebase, and thus GPL licensed:
import math
def entropy(string):
"Calculates the Shannon entropy of a string"
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
# calculate the entropy
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
def entropy_ideal(length):
"Calculates the ideal Shannon entropy of a string with given length"
prob = 1.0 / length
return -1.0 * length * prob * math.log(prob) / math.log(2.0)
Note that this implementation assumes that your input bit-stream is best represented as bytes. This may or may not be the case for your problem domain. What you really want is your bitstream converted into a string of numbers. Just how you decide on what those numbers are is domain specific. If your numbers really are just one and zeros, then convert your bitstream into an array of ones and zeros. The conversion method you choose will affect the results you get, however.
Upvotes: 38
Reputation: 4107
The NIST Random Number Generator evaluation toolkit has a way of calculating "Approximate Entropy." Here's the short description:
Approximate Entropy Test Description: The focus of this test is the frequency of each and every overlapping m-bit pattern. The purpose of the test is to compare the frequency of overlapping blocks of two consecutive/adjacent lengths (m and m+1) against the expected result for a random sequence.
And a more thorough explanation is available from the PDF on this page:
http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html
Upvotes: 11
Reputation: 74382
Entropy is not a property of the string you got, but of the strings you could have obtained instead. In other words, it qualifies the process by which the string was generated.
In the simple case, you get one string among a set of N possible strings, where each string has the same probability of being chosen than every other, i.e. 1/N. In the situation, the string is said to have an entropy of N. The entropy is often expressed in bits, which is a logarithmic scale: an entropy of "n bits" is an entropy equal to 2n.
For instance: I like to generate my passwords as two lowercase letters, then two digits, then two lowercase letters, and finally two digits (e.g. va85mw24
). Letters and digits are chosen randomly, uniformly, and independently of each other. This process may produce 26*26*10*10*26*26*10*10 = 4569760000 distinct passwords, and all these passwords have equal chances to be selected. The entropy of such a password is then 4569760000, which means about 32.1 bits.
Upvotes: 43
Reputation: 81
There is no single answer. Entropy is always relative to some model. When someone talks about a password having limited entropy, they mean "relative to the ability of an intelligent attacker to predict", and it's always an upper bound.
Your problem is, you're trying to measure entropy in order to help you find a model, and that's impossible; what an entropy measurement can tell you is how good a model is.
Having said that, there are some fairly generic models that you can try; they're called compression algorithms. If gzip can compress your data well, you have found at least one model that can predict it well. And gzip is, for example, mostly insensitive to simple substitution. It can handle "wkh" frequently in the text as easily as it can handle "the".
Upvotes: 8
Reputation: 26932
I believe the answer is the Kolmogorov Complexity of the string. Not only is this not answerable with a chunk of pseudocode, Kolmogorov complexity is not a computable function!
One thing you can do in practice is compress the bit string with the best available data compression algorithm. The more it compresses the lower the entropy.
Upvotes: 21