Gzip to base64 encoding adds characters to JSON string

Question

I have a nested python dictionary that is serialized into a json string, that I am further converting to a compressed Gzip file and base64 encoding it. However, once I convert it back to the JSON string, it adds \ to the string, which isn't in the original JSON string before conversion. This happens at each of the nested dictionary levels. These are the functions:

import json
import io
import gzip
import base64
import zlib

class numpy_encoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(numpy_encoder, self).default(obj)


def dict_json_dump(dictionary):
    dumped = json.dumps(dictionary, cls = numpy_encoder, separators=(",", ":"))
    return dumped

def gzip_json_encoder(json_string):
    stream = io.BytesIO()
    with gzip.open(filename=stream, mode='wt') as zipfile:
        json.dump(json_string, zipfile)
    return stream

def base64_encoder(gzip_string):
    return base64.b64encode(gzip_string.getvalue())

We can use the functions as follows:

json_dict = pe.dict_json_dump(test_dictionary)
gzip_json = pe.gzip_json_encoder(json_dict)
base64_gzip = pe.base64_encoder(gzip_json)

When I check the base64_gzip with the following function:

json_str = zlib.decompress(base64.b64decode(base64_gzip), 16 + zlib.MAX_WBITS)

I get the JSON string back in a format like this(truncated):

b'"{\"trainingResults\":{\"confusionMatrix\":{\"tn\":2,\"fn\":1,\"tp\":1,\"fp\":1},\"auc\":{\"score\":0.5,\"tpr\":[0.0,0.5,0.5,1.0],\"fpr\":[0.0,0.333,0.667,1.0]},\"f1\"

This isn't the full string, but the contents of the string itself is accurate. What I'm not sure about is why the back slashes are showing up when I convert it back. Anyone have any suggestions? I tried utf-8 encoding on my JSON as well, with no luck. Any help is appreciated!

Barmar · Accepted Answer

You're doing JSON encoding twice: Once in dict_json_dump() and again in gzip_json_encoder(). Since json_string is already encoded, you don't need to call json.dump() in gzip_json_encoder().

def gzip_json_encoder(json_string):
    stream = io.BytesIO()
    with gzip.open(filename=stream, mode='wt') as zipfile:
        zipfile.write(json_string)
    return stream

Gzip to base64 encoding adds characters to JSON string

Answers (1)

Related Questions