Jon Hutton
Jon Hutton

Reputation: 31

File size of 1 million rand numbers

The file by rand is 1 million random numbers. It is compressed down to 415 kb....how is this possible if it is impossible to compress random data.

Thank you.

Jon Hutton

Upvotes: 1

Views: 924

Answers (2)

schnaader
schnaader

Reputation: 49719

You're most likely talking about the famous "A Million Random Digits" test data that was published in 1955. So it's digits, not numbers, as Mark already guessed, that's why the binary version is only 415,241 bytes. Also see Mark Nelson's homepage that has a link to the binary file.

Note that the end result (the binary file) is not compressible without knowing it - although there are some small redundancies in the file that come from the way it was created - see this forum entry for more details:

There are potentially other biases in the million random digits file that I discussed years ago in comp.compression. The data was originally generated by sampling a 5 bit counter driven by a noisy oscillator to produce a set of 20,000 punched cards with 50 digits each. But there was some correlation between consecutive digits, so what they did was add adjacent pairs of cards modulo 10 to produce a new set of cards which was published. That is why the sums of the columns are even. Each of the original cards is counted twice.

Upvotes: 3

Mark Adler
Mark Adler

Reputation: 112384

Sounds like they're stored as one decimal digit per byte. So using only ten of the 256 possible bytes values leaves you with the potential for a log(256)/log(10) compression ratio on random digits, which is about 2.4. You're getting 2.35 (assuming "kb" = 1024 bytes). Voila.

You can get 2.4 quite easily by coding every three digits into ten bits, since 1024 > 1000. Then you can code 1,000,000 decimal digits into 416,667 bytes, or 406.9 KiB.

With a little more difficulty, using something like GMP, you could code it as a giant million-digit integer in binary, which would take 415,242 bytes, or 405.5 KiB. That would be as good as it gets for random decimal digits.

Upvotes: 2

Related Questions