HugoRune
HugoRune

Reputation: 13789

Calculate threshold, so that 99% of random string have higher entropy

I am trying to calculate the distribution of the Shannon entropy for all possible random strings of length N. (by random, I mean that each letter at each position is picked with equal probability)

To calculate the Shannon entropy, I am using a formula similar to e.g. https://stackoverflow.com/a/2979208/145999.

prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])

Now I am trying to find a threshold t, so that e.g. 99% of all random strings will have a higher entropy than t.

How can I calculate this threshold entropy, given the length of the string, the number of possible different letters, and the desired fraction of strings that should have higher entropy?

Upvotes: 3

Views: 103

Answers (2)

Dillon Davis
Dillon Davis

Reputation: 7740

I agree with the others that shannon entropy is probably not what you want for your particular application, however I wanted to still offer a solution (a) for your own hands-on experimentation and (b) for anyone else who happens to stumble upon this question that does need the calculation for other purposes.

Note that we can rewrite the equation for shannon entropy to deal only in (large) integers, to avoid floating point issues until the very end.

enter image description here

We will use the right-hand-side summation at the end for evaluation to actual entropy values, but in our dynamic programming table we'll use the product variant.

The basic premise is that we can iterate over all the lengths of strings up to full_length and character sets sizes of up to full_alpha_size and incrementally build up a table of (adjusted) entropy values and their respective counts (using a tiny bit of combinatorics). Once this table is built up, we'll iterate over the entropy values for full_length and full_alpha_size, converting them to their corresponding shannon entropy values (becoming constant floats), and yield them along with their corresponding counts.

We can then use this function, generate all possible entropy values and the corresponding count of distinct strings which have that entropy, build up a cumulative count in increasing entropy, and find the first entropy that exceeds our target percentile.

Code:

from collections import defaultdict
from math import ceil, comb, log2

def count_entropy_values(full_length, full_alpha_size):
    table = [[defaultdict(int) for _ in range(full_alpha_size + 1)]
                               for _ in range(full_length + 1)]
    
    for alpha_size in range(full_alpha_size + 1):
        table[0][alpha_size][1] = 1
        
    for length in range(1, full_length + 1):
        for alpha_size in range(1, full_alpha_size + 1):
            for repeat_count in range(length + 1):
                for entropy, table_count in table[length-repeat_count][alpha_size-1].items():
                    new_entropy = entropy * repeat_count ** repeat_count
                    table[length][alpha_size][new_entropy] += table_count * comb(length, repeat_count)
                    

    for entropy, table_count in table[full_length][full_alpha_size].items():
        real_entropy = log2(full_length) - log2(entropy) / full_length
        yield real_entropy, table_count

def calculate_entropy_threshold(length, alpha_size, percentile):
    table = sorted(count_entropy_values(length, alpha_size))
    
    target = ceil(alpha_size ** length * percentile)
    total_count = 0
    
    for entropy, count in table:
        total_count += count
        if total_count >= target:
            return entropy

For the same target (strings of length of 10 with 26 characters to choose from), and seeking the 99-th percentile entropy:

>>> calculate_entropy_threshold(10, 26, 0.99)
3.321928094887362

Note that this is the same value obtained empirically in Kaia's answer.

This takes about a millisecond on my machine. Larger values will inevitably be slower, however for length:30 alphabet:100 it still only took a few seconds, so hopefully that's still fast enough for most practical usage.

If you want to query for multiple different lengths and character set sizes, you can edit the code to reuse the dynamic programming table to avoid recomputing that every time (which is the most expensive part).

Upvotes: 0

Kaia
Kaia

Reputation: 891

It seems like the quick-and-dirty method would be to do a monte-carlo:

from typing import *
import math
import random

def get_entropy(string: str) -> int:
    prob = [float(string.count(c)) / len(string) for c in dict.fromkeys(list(string))]
    return - sum([p * math.log2(p) for p in prob])

def get_random_string(length=10, chars='abcdefghijklmnopqrstuvwxyz'):
    return random.choices(chars, k=length)

def monte_carlo_entropy(random_function: Callable[[], str], num_sims=10000) -> list[int]:
    entropies = []
    for i in range(num_sims):
        entropies.append(get_entropy(random_function()))
    entropies.sort()
    return entropies

def get_percentile(sorted_entropies: list[int], percentile=0.99) -> int:
    index = int(len(sorted_entropies) * percentile)
    return sorted_entropies[index]

entropies = monte_carlo_entropy(lambda: get_random_string(10))
print(get_percentile(entropies, 0.99))

For that specific length, character set, and uniforn distribution, I get 3.321 bits.

It's worth noting that this entropy definition is more than a little bit weird and probably not suited for this application. No matter how long the password is, a lowercase a-z password will have a maximum of entropy of -26 * (1/26 * log2(1/26)) ~= 4.700.

Upvotes: 1

Related Questions