user707779
user707779

Reputation:

C++ random generator with provided (at least estimated) entropy

Using C++ standard random generator I can more or less efficiently create sequences with pre-defined distributions using language-provided tools. What about Shannon entropy? Is it possible some way to define output Shannon entropy for the provided sequence?

I tried a small experiment, generated a long enough sequence with linear distribution, and implemented a Shannon entropy calculator. Resulting value is from 0.0 (absolute order) to 8.0 (absolute chaos)

template <typename T>
double shannon_entropy(T first, T last)
{
    size_t frequencies_count{};
    double entropy = 0.0;

    std::for_each(first, last, [&entropy, &frequencies_count] (auto item) mutable {

        if (0. == item) return;
        double fp_item = static_cast<double>(item);
        entropy += fp_item * log2(fp_item);
        ++frequencies_count;
    });

    if (frequencies_count > 256) {
        return -1.0;
    }

    return -entropy;
}

std::vector<uint8_t> generate_random_sequence(size_t sequence_size)
{
    std::vector<uint8_t> random_sequence;
    std::random_device rnd_device;

    std::cout << "Random device entropy: " << rnd_device.entropy() << '\n';

    std::mt19937 mersenne_engine(rnd_device());
    std::uniform_int_distribution<unsigned> dist(0, 255);

    auto gen = std::bind(dist, mersenne_engine);
    random_sequence.resize(sequence_size);
    std::generate(random_sequence.begin(), random_sequence.end(), gen);
    return std::move(random_sequence);
}

std::vector<double> read_random_probabilities(size_t sequence_size)
{
    std::vector<size_t> bytes_distribution(256);
    std::vector<double> bytes_frequencies(256);

    std::vector<uint8_t> random_sequence = generate_random_sequence(sequence_size);

    size_t rnd_seq_size = random_sequence.size();
    std::for_each(random_sequence.begin(), random_sequence.end(), [&](uint8_t b) mutable {
        ++bytes_distribution[b];
    });

    std::transform(bytes_distribution.begin(), bytes_distribution.end(), bytes_frequencies.begin(),
        [&rnd_seq_size](size_t item) {
        return static_cast<double>(item) / rnd_seq_size;
    });
    return std::move(bytes_frequencies);
}

int main(int argc, char* argv[]) {

    size_t sequence_size = 1024 * 1024;
    std::vector<double> bytes_frequencies = read_random_probabilities(sequence_size);
    double entropy = shannon_entropy(bytes_frequencies.begin(), bytes_frequencies.end());

    std::cout << "Sequence entropy: " << std::setprecision(16) << entropy << std::endl;

    std::cout << "Min possible file size assuming max theoretical compression efficiency:\n";
    std::cout << (entropy * sequence_size) << " in bits\n";
    std::cout << ((entropy * sequence_size) / 8) << " in bytes\n";

    return EXIT_SUCCESS;
}

First, it appears that std::random_device::entropy() hardcoded to return 32; in MSVC 2015 (which is probably 8.0 according to Shannon definition). As you can try it's not far from the truth, this example it's always close to 7.9998..., i.e. absolute chaos.

The working example is on IDEONE (by the way, their compiler hardcode entropy to 0)

One more, the main question - is it possible to create such a generator that generate linearly-distributed sequence with defined entropy, let's say 6.0 to 7.0? Could it be implemented at all, and if yes, if there are some implementations?

Upvotes: 2

Views: 870

Answers (3)

mr.stobbe
mr.stobbe

Reputation: 612

First, you're viewing Shannon's theory entirely wrong. His argument (as you're using it) is simply, "given the probably of x (Pr(x)), the bits required to store x is -log2 Pr(x). It has nothing to do with the probability of x. In this regard, you're viewing Pr(x) wrong. -log2 Pr(x) given a Pr(x) that should be uniformly 1/256 results in a required bitwidth of 8 bits to store. However, that's not how statistics work. Go back to thinking about Pr(x) because the bits required means nothing.

Your question is about statistics. Given an infinite sample, if-and-only-if the distribution matches the ideal histogram, as the sample size approaches infinite the probability of each sample will approach the expected frequency. I want to make it clear that you're not looking for "-log2 Pr(x) is absolute chaos when it's 8 given Pr(x) = 1/256." A uniform distribution is not chaos. In fact, it's... well, uniform. It's properties are well known, simple, and easy to predict. You're looking for, "Is the finite sample set of S meeting the criteria of a independently-distributed uniform distribution (commonly known as "Independently and Identically Distributed Data" or "i.i.d") of Pr(x) = 1/256?" This has nothing to do with Shannon's theory and goes much further back in time to the basic probability theories involving flips of a coin (in this case binomial given assumed uniformity).

Assuming for a moment that any C++11 <random> generator meets the criteria for "statistically indistinguishable from i.i.d." (which, by the way, those generators don't), you can use them to emulate i.i.d. results. If you would like a range of data that is storable within 6..7 bits (it wasn't clear, did you mean 6 or 7 bits, because hypothetically, everything in between is doable as well), simply scale the range. For example...

#include <iostream>
#include <random>

int main() {
    unsigned long low = 1 << 6; // 2^6 == 64
    unsigned long limit = 1 << 7; // 2^7 == 128
    // Therefore, the range is 6-bits to 7-bits (or 64 + [128 - 64])
    unsigned long range = limit - low;
    std::random_device rd;
    std::mt19937 rng(rd()); //<< Doesn't actually meet criteria for i.d.d.
    std::uniform_int_distribution<unsigned long> dist(low, limit - 1); //<< Given an engine that actually produces i.i.d. data, this would produce exactly what you're looking for
    for (int i = 0; i != 10; ++i) {
        unsigned long y = dist(rng);
        //y is known to be in set {2^6..2^7-1} and assumed to be uniform (coin flip) over {low..low + (range-1)}.
        std::cout << y << std::endl;
    }
    return 0;
}

The problem with this is that, while the <random> distribution classes are accurate, the random number generators (presumably aside from std::random_device, but that's system-specific) are not designed to stand up to statistical tests of fitness as i.i.d. generators.

If you would like one that does, implement a CSPRNG (my go-to is Bob Jenkins' ISAAC) that has an interface meeting the requirements of the <random> class of generators (probably just covering the basic interface of std::random_device is good enough).

To test for statistically sound "no" or "we can't say no" for whether a set follows a specific model (and therefore Pr(x) is accurate and therefore Shannon's entropy function is an accurate prediction), that's a whole thing else entirely. Like I said, no generator in <random> meets these criteria (except maybe std::random_device). My advice is to do research into things like Central limit theorem, Goodness-of-fit, Birthday-spacing, et cetera.

To drive my point a bit more, under the assumptions of your question...

struct uniform_rng {
    unsigned long x;
    constexpr uniform_rng(unsigned long seed = 0) noexcept:
        x{ seed }
    { };

    unsigned long operator ()() noexcept {
        unsigned long y = this->x++;
        return y;
    }
};

... would absolutely meet your criteria of being uniform (or as you say "absolute chaos"). Pr(x) is most certainly 1/N and the bits required to store any number of the set is -log2 Pr(1/N) which is whatever 2 to the power of the bitwidth of unsigned long is. However, it's not independently distributed. Because we know it's properties, you can "store" it's entire sequence by simply storing seed. Surprise, all PRNGs work this way. Therefore the bits required to store the entire sequence of an PRNG is -log2(1/2^bitsForSeed). As your sample grows, the bits required to store vs the bits your able to generate that sample (aka, the compression ratio) approaches a limit of 0.

Upvotes: 3

Hans Olsson
Hans Olsson

Reputation: 12517

You are not clear what you want to achieve, and there are several ways of lowering the Shannon entropy for your sequence:

  • Correlation between the bits, e.g. putting random_sequence through a simple filter.
  • Individual bits are not fully random.

As an example below you could make the bytes less random:

 std::vector<uint8_t> generate_random_sequence(size_t sequence_size, 
  int unit8_t cutoff=10)
{
    std::vector<uint8_t> random_sequence;
    std::vector<uint8_t> other_sequence;
    std::random_device rnd_device;

    std::cout << "Random device entropy: " << rnd_device.entropy() << '\n';

    std::mt19937 mersenne_engine(rnd_device());
    std::uniform_int_distribution<unsigned> dist(0, 255);

    auto gen = std::bind(dist, mersenne_engine);
    random_sequence.resize(sequence_size);
    std::generate(random_sequence.begin(), random_sequence.end(), gen);
    other_sequence.resize(sequence_size);
    std::generate(other_sequence.begin(), other_sequence.end(), gen);
    for(size_t j=0;j<size;++j) {
      if (other_sequence[j]<=cutoff) random_sequence[j]=0; // Or j or ...
    }
    return std::move(random_sequence);
}

I don't think this was the answer you were looking for - so you likely need to clarify the question more.

Upvotes: 0

JHBonarius
JHBonarius

Reputation: 11281

I cannot comment yet, but I would like to start the discussion: From communication/information theory, it would seem that you would require probabilistic shaping methods to achieve what you want. You should be able to feed the output of any distribution function through a shaping coder, which then should re-distribute the input to a specific target shannon entropy. Probabilistic constellation shaping has been succesfully applied in fiber-optic communication: Wikipedia with some other links

Upvotes: 1

Related Questions