Is the following lossless data compression algorithm theoretically valid?
I am wondering if the following algorithm is a valid lossless data compression algorithm (although not practical with traditional computers, maybe quantum computers?).
At a high and simplified level, the compression steps are:
- Calculate the character frequency of the uncompressed text.
- Calculate the SHA3-512 (or another hash function) of the uncompressed text.
- Concatenate the SHA3-512 and the character frequency (this is now the compressed text that would be written to a file).
And at a high and simplified level, the decompression steps are:
- Using the character frequency in the compressed file, generate a permutation of the uncompressed text (keep track of which one).
- Calculate the SHA3-512 of the generated permutation in step 1.
- If the SHA3-512 calculated in step 2 matches the SHA3-512 in the compressed file, the decompression is complete. Else, go to step 1.
Would it be possible to have a SHA3-512 collision with a permutation of the uncompressed text (i.e. can two permutations of a given character frequency have the same SHA3-512?)? If so, when could this start happening (i.e. after how many uncompressed text characters?)?
One simplified example is as follows:
- The uncompressed text is: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas et enim vitae ligula ultricies molestie at ac libero. Duis dui erat, mollis nec metus nec, porttitor scelerisque enim. Aenean feugiat tellus sit amet facilisis imperdiet. Fusce et nisl porta, aliquam quam eget, mollis sapien. Sed purus est, efficitur elementum quam quis, congue rutrum libero. Etiam metus leo, hendrerit ac dui in, hendrerit blandit sem. Etiam pellentesque enim dapibus luctus volutpat. Praesent aliquet ipsum vitae mauris pulvinar, et pharetra leo semper. Nulla a mauris tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Integer sollicitudin dui sapien, in tempus arcu facilisis in. Vivamus dui dolor, faucibus eu accumsan eu, porttitor id risus. In auctor congue pellentesque. Cras malesuada enim eget est vehicula pretium. Phasellus scelerisque imperdiet lorem, eu euismod lectus convallis consequat. Nam vitae euismod est, vitae lacinia arcu. Praesent fermentum sit amet erat feugiat cursus. Pellentesque magna felis, euismod vel vehicula eu, tincidunt ac ex. Vestibulum viverra justo nec orci semper, nec consequat justo faucibus. Curabitur dignissim feugiat nulla, in cursus nunc facilisis id. Suspendisse potenti. Etiam commodo turpis non fringilla semper. Vivamus aliquam ex non lorem tincidunt, et sagittis tellus placerat. Proin malesuada tortor eu viverra faucibus. Curabitur euismod orci lorem, ut fermentum velit consectetur vel. Nullam sodales cursus maximus. Curabitur nec turpis erat. Vestibulum eget lorem nunc. Morbi laoreet massa vel nulla feugiat gravida. Nulla a rutrum neque. Phasellus maximus tempus neque, eu sagittis ex volutpat ac. Duis malesuada sem vitae lacus suscipit, eu dictum elit euismod. Sed id sagittis leo. Sed convallis nisi nisl, vel pretium elit cursus vel. Duis quis accumsan odio. Ut arcu ex, iaculis a lectus sit amet, lacinia pellentesque enim. Donec maximus ante odio, a porta odio luctus at. Nullam dapibus aliquet sollicitudin. Sed ultrices iaculis blandit. Suspendisse dapibus, odio non venenatis faucibus, justo urna euismod neque, non finibus ante ante in massa. Sed sit amet nunc vel lacus dictum euismod. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Interdum et malesuada fames ac ante ipsum primis in faucibus. Fusce varius lacus velit, venenatis consequat justo rutrum nec. Nunc cursus odio arcu, nec egestas purus feugiat nec. Aliquam efficitur ornare ullamcorper. Mauris consectetur, quam vitae ultricies ullamcorper, nulla nulla tempus risus, aliquet euismod urna erat gravida neque. Suspendisse et viverra enim, ut facilisis enim. Quisque quis elit diam. Morbi quis nulla bibendum, molestie risus egestas, pharetra nisl. Aliquam sed massa dictum, scelerisque odio vel, finibus tellus. Nam tristique commodo sem, a dictum risus euismod sed. Morbi vel urna nec sem consectetur auctor quis ac augue. Donec ac pellentesque tortor. In hendrerit ultricies consequat. Pellentesque non metus vitae elit euismod efficitur in in leo. Nulla ac pulvinar nunc. Donec porttitor nunc ante, et congue augue laoreet ac. Vivamus bibendum id est eleifend efficitur. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc arcu neque, molestie ac lorem id, feugiat efficitur erat. Vestibulum vel condimentum lectus, eu euismod turpis.".
- The character frequency is: "⎵:501 e:345 i:277 u:266 s:240 t:226 a:219 l:161 n:154 r:147 m:132 c:128 o:117 d:79 .:64 p:54 ,:47 v:40 q:39 f:35 g:31 b:31 h:11 P:9 N:9 S:8 x:7 D:6 V:6 M:5 I:4 C:4 j:4 L:3 A:3 E:3 F:2 U:1 Q:1".
- The SHA3-512 is: "45ebde65cf667d1bfdcf779baab84301c1d4abe60448be821adda9cf7b99b36a61c53233db4a0eda93a04c75201be13bbb638b5e78f5047560fffc97f1c95adb".
- The compressed file contents are: "45ebde65cf667d1bfdcf779baab84301c1d4abe60448be821adda9cf7b99b36a61c53233db4a0eda93a04c75201be13bbb638b5e78f5047560fffc97f1c95adb⎵:501 e:345 i:277 u:266 s:240 t:226 a:219 l:161 n:154 r:147 m:132 c:128 o:117 d:79 .:64 p:54 ,:47 v:40 q:39 f:35 g:31 b:31 h:11 P:9 N:9 S:8 x:7 D:6 V:6 M:5 I:4 C:4 j:4 L:3 A:3 E:3 F:2 U:1 Q:1".
Answers (1)
Your compression method assumes that there is only one permutation of the given character frequency table that will generate the given hash code. That's provably false.
A 512-bit hash can represent on the order of 1.34E+154 unique values. The number of permutations in a 100-character file is 100!, or 9.33E+157.
Given a 100-character file, there are over 6,900 different permutations for each possible 512-bit hash code.
Using a larger hash code won't help. The number of hash codes doubles with each bit you add, but the number of possible permutations grows more with each character you add to the file.