mlenthusiast
mlenthusiast

Reputation: 1204

Gzip to base64 encoding adds characters to JSON string

I have a nested python dictionary that is serialized into a json string, that I am further converting to a compressed Gzip file and base64 encoding it. However, once I convert it back to the JSON string, it adds \\ to the string, which isn't in the original JSON string before conversion. This happens at each of the nested dictionary levels. These are the functions:

import json
import io
import gzip
import base64
import zlib

class numpy_encoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(numpy_encoder, self).default(obj)


def dict_json_dump(dictionary):
    dumped = json.dumps(dictionary, cls = numpy_encoder, separators=(",", ":"))
    return dumped

def gzip_json_encoder(json_string):
    stream = io.BytesIO()
    with gzip.open(filename=stream, mode='wt') as zipfile:
        json.dump(json_string, zipfile)
    return stream

def base64_encoder(gzip_string):
    return base64.b64encode(gzip_string.getvalue())

We can use the functions as follows:

json_dict = pe.dict_json_dump(test_dictionary)
gzip_json = pe.gzip_json_encoder(json_dict)
base64_gzip = pe.base64_encoder(gzip_json)

When I check the base64_gzip with the following function:

json_str = zlib.decompress(base64.b64decode(base64_gzip), 16 + zlib.MAX_WBITS)

I get the JSON string back in a format like this(truncated):

b'"{\\"trainingResults\\":{\\"confusionMatrix\\":{\\"tn\\":2,\\"fn\\":1,\\"tp\\":1,\\"fp\\":1},\\"auc\\":{\\"score\\":0.5,\\"tpr\\":[0.0,0.5,0.5,1.0],\\"fpr\\":[0.0,0.333,0.667,1.0]},\\"f1\\"

This isn't the full string, but the contents of the string itself is accurate. What I'm not sure about is why the back slashes are showing up when I convert it back. Anyone have any suggestions? I tried utf-8 encoding on my JSON as well, with no luck. Any help is appreciated!

Upvotes: 2

Views: 1264

Answers (1)

Barmar
Barmar

Reputation: 782407

You're doing JSON encoding twice: Once in dict_json_dump() and again in gzip_json_encoder(). Since json_string is already encoded, you don't need to call json.dump() in gzip_json_encoder().

def gzip_json_encoder(json_string):
    stream = io.BytesIO()
    with gzip.open(filename=stream, mode='wt') as zipfile:
        zipfile.write(json_string)
    return stream

Upvotes: 3

Related Questions