yurisich
yurisich

Reputation: 7119

Putting gzipped data into a script as a string

I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.

My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.

For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.

I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.

Upvotes: 0

Views: 4967

Answers (4)

Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:

import base64
from io import BytesIO

DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"

sample_text_file = gzip.GzipFile(
    mode='rb',
    fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)

binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.

Upvotes: 1

cdjc
cdjc

Reputation: 1108

How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.

from StringIO import StringIO
import base64
import gzip

contents = 'The quick brown fox jumps over the lazy dog'

zip_text_file = StringIO()

zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)

zipper.write(contents)
zipper.close()

enc_text =  base64.b64encode(zip_text_file.getvalue())
print enc_text

sample_text_file = gzip.GzipFile(mode='rb',
    fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE

Upvotes: 1

Toote
Toote

Reputation: 3413

Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.

So you need to compress it first:

$ gzip file.txt

That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:

$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=

Now you have all what you need to use the contents of that file in your python script:

from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile

# this is the variable with your file's contents    
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""

# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r', 
                          fileobj=StringIO(b64decode(gzipped_data)))

# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()

# and close the file descriptor
orig_file_desc.close()

Obviously, your program will depend on the base64, gzip and cStringIO python modules.

Upvotes: 5

larsks
larsks

Reputation: 311526

I'm not sure exactly what you're asking, but here's a stab...

The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.

Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.

Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:

sample_text_file = gzip.GzipFile(mode='rb',
    fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))

This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.

All the modules mentioned here are described in the documentation for the Python standard library.

I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.

Have I answered the right question?

Upvotes: 3

Related Questions