ni8mr
ni8mr

Reputation: 1785

What does zlib.compress(string) return in Python 2.7.8

I am quite new to Python. Today i have come to now about zlib module and have run the following code-

import zlib
s = 'hello world!hello world!hello world!hello world!'
t = zlib.compress(s)
print t
print zlib.decompress(t)

and it returns the following thing:

xœËHÍÉÉW(Ï/ÊIQÌ ‚
hello world!hello world!hello world!hello world!

obviously, zlib.compress() also returns some weird notations other than these, which i can't copy-paste to my question.

My question is-

1) What does compressing a string actually means?

2) Are there any meanings (or any kind of conventions) of these weird notations?

3) What are the real life applications of compress() function?

N.B.- I don't know any other programming language. So i have very little programming experience.

Upvotes: 1

Views: 1241

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121484

You are printing the compressed data. Compressed data is not text, it is just binary data, that represents the same information in less space.

By writing that compressed data to your terminal, it may try and interpret the data as text still; if it expected Latin-1 or UTF-8 encoded text, then it'll try to decode that data and display the text that it managed to decode. So you end up with gibberish, because the data is not actually text.

My Mac terminal is set to UTF-8, and I get something different from what you see:

>>> import zlib
>>> s = 'hello world!hello world!hello world!hello world!'
>>> t = zlib.compress(s)
>>> print t
?[?H???W(?/?IQ? ?

The ? question marks indicate that the terminal wasn't even able to decode everything as UTF-8; quite expected because the data isn't valid UTF-8.

Different encodings will result in different output; again, because the data isn't actually representing text in any text codec:

>>> print t.decode('cp850').encode('utf8')
¢[§H═╔╔W(¤/╩IQ╠ é
>>> print t.decode('cp1251').encode('utf8')
Ѕ[хHНЙЙW(П/КIQМ ‚
>>> print t.decode('mac-roman').encode('utf8')
Ω[ıHÕ……W(œ/ IQà Ç

The .encode('utf8') calls are really redundant; Python has detected that I use a UTF-8 terminal and will automatically encode Unicode strings for me.

Python can also give you different representations of the same data; echoing the string in your Python interpreter (rather than using print), or printing the output of repr() gives you output formatted as a Python string literal, that'll recreate the same value:

>>> t
'x\x9c\xcbH\xcd\xc9\xc9W(\xcf/\xcaIQ\xcc \x82\r\x00\xbd[\x11\xf5'
>>> print repr(t)
'x\x9c\xcbH\xcd\xc9\xc9W(\xcf/\xcaIQ\xcc \x82\r\x00\xbd[\x11\xf5'

Any byte that can be interpreted as a printable ASCII character is shown as such, everything else is shown as \xhh hex escapes (with newlines, carriage returns and tabs using \n, \r and \t, respectively).

You could also encode all the byte values to hex:

>>> print t.encode('hex')
789ccb48cdc9c95728cf2fca4951cc20820d00bd5b11f5

Having data take up less space is very useful. Sending the data across a network will take less time (less data to send), or you can save on disk space. When compressing images, you could even discard some information while compressing; JPEG images use such a lossy compression scheme, for example. Depending on the quality level you set you'll lose more or less original information, but you can cram a lot of image information into a file that way.

Upvotes: 5

Related Questions