Python Zlib: Why the base64 encoded content not consistent when compressed and decompressed

Question

My questions are as follows:
I have a number of these strings, which are compressed by the Java Inflater tool and encoded in base64. To do this I wrote the following python code to solve for the plaintext in reverse.
When I try a short ciphertext, the values before and after compression and uncompression are the same, but when I try a long ciphertext, the values before and after compression and uncompression will change.

My codes are as followed.

import base64
import zlib

cobj = zlib.compressobj(wbits=-zlib.MAX_WBITS)
dobj = zlib.decompressobj(wbits=-zlib.MAX_WBITS)

# read ciphertext 
with open("base64str", "r") as f:
    b = f.read()

lenth = len(b)
# decode base64
d = base64.b64decode(b)
# decompress
data_string = dobj.decompress(d)
data_string += dobj.flush()
# wirte into file
with open('data_string', 'wb') as f:
    f.write(data_string)

# read plaintext
with open('data_string', 'rb') as f:
    data_string = f.read()
# compress
data1 = cobj.compress(data_string)
data1 += cobj.flush()
# encode base64
data2 = base64.b64encode(data1)
# write into file
with open('data2', 'wb') as f:
    f.write(data2)

The example ciphertext is as follows：
The Short one are
S0vMKU4FAA==
c9d18bZw13UJdDbW9QoMcYsMBwA=

The Long one is

VdC7asNAEAXQH7Jg3g91chDGYBQh4iKEFCncB6ImBP97xohVSLFb7FzOzM7bz/r9eetPyzhOh6/1Y70NvSFwgiImGQqjuG+lY995OKEDAmiGV4zxftiM4+U6NqJLRHdiRGRTqItkN9SolHpVc3nEUpoxX5f58qeEVxMxdc2sGHtCU4xqFMFQVjcVJI6GPC/DdNoRCXDEqHowu0q1bUYXUfMHK7gBCZGkNuTpddgX0lEQMlfUIgwN6jQDy4+I6g/oAZb/FzIv5+llyw59x6hcHTOVuYKShs1Rqn9CqIoxqxbo9/df

You can copy it into the file: base64str and try, which seen in my code Line 7.
Finally, you could compare the differences between the two files base64str and data2,

Btw the file data_string is the plaintext.

I tried many ciphertext such above, but got diffrent results
I want to know what causes this to happen, and how can I ensure consistent results, no matter how long the ciphertext is?

Mark Adler · Accepted Answer

Your code is the opposite of what you say in your question. Your question says "compressed and decompressed" in the title, and "after compression and uncompression" in the body. However your code does decompression, and then compression. x -> compress -> decompress -> y is a completely different thing than x -> decompress -> compress -> y.

A lossless compressor guarantees that compression followed by decompression will give you exactly the same thing back. However there is no assurance whatsoever that decompression followed by compression will give you the same compressed data you started with. If the compressor differs in the compression settings it was given, how the data was fed to it, or the compressor itself differs, either in version or simply different code, then the compressed data may very well be different. Only for very short texts will most compressors give the same thing, because there are few choices to be made to minimize the size of the output.

Python Zlib: Why the base64 encoded content not consistent when compressed and decompressed

Answers (1)

Related Questions