rcastano
rcastano

Reputation: 13

Is the following lossless data compression algorithm theoretically valid?

I am wondering if the following algorithm is a valid lossless data compression algorithm (although not practical with traditional computers, maybe quantum computers?).

At a high and simplified level, the compression steps are:

  1. Calculate the character frequency of the uncompressed text.
  2. Calculate the SHA3-512 (or another hash function) of the uncompressed text.
  3. Concatenate the SHA3-512 and the character frequency (this is now the compressed text that would be written to a file).

And at a high and simplified level, the decompression steps are:

  1. Using the character frequency in the compressed file, generate a permutation of the uncompressed text (keep track of which one).
  2. Calculate the SHA3-512 of the generated permutation in step 1.
  3. If the SHA3-512 calculated in step 2 matches the SHA3-512 in the compressed file, the decompression is complete. Else, go to step 1.

Would it be possible to have a SHA3-512 collision with a permutation of the uncompressed text (i.e. can two permutations of a given character frequency have the same SHA3-512?)? If so, when could this start happening (i.e. after how many uncompressed text characters?)?

One simplified example is as follows:

Upvotes: 1

Views: 129

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 133985

Your compression method assumes that there is only one permutation of the given character frequency table that will generate the given hash code. That's provably false.

A 512-bit hash can represent on the order of 1.34E+154 unique values. The number of permutations in a 100-character file is 100!, or 9.33E+157.

Given a 100-character file, there are over 6,900 different permutations for each possible 512-bit hash code.

Using a larger hash code won't help. The number of hash codes doubles with each bit you add, but the number of possible permutations grows more with each character you add to the file.

Upvotes: 3

Related Questions