MiniQuark
MiniQuark

Reputation: 48446

How to truncate data in a dict so that the resulting JSON isn't longer than n bytes?

I have a python 2.7 dict such as {u"eat": u"糖果", u"drink": u"café"}, and I need to transfer it using JSON. The JSON string must be regular ASCII and it must be less than 256 chars.

So far, I have coded this:

import json

def payload_to_json(payload, max_size = 256):
    while True:
        json_string = json.dumps(payload, separators = (',', ':'))
        if len(json_string) <= max_size:
            return json_string
        max_length, found_key = 0, None
        for key, value in payload.iteritems():
            length = len(value)
            if length > max_length:
                max_length = length
                found_key = key
        if max_length == 0:
            return "" # just in case max_size is really low
        payload[found_key] = payload[found_key][:-1] # remove one char

It works as expected:

>>> payload = {u"eat": u"糖果", u"drink": u"café"}
>>> print payload_to_json(payload)
{"drink":"caf\u00e9","eat":"\u7cd6\u679c"}
>>> print payload_to_json(payload, max_size=41)
{"drink":"caf","eat":"\u7cd6\u679c"}
>>> print payload_to_json(payload, max_size=35)
{"drink":"ca","eat":"\u7cd6\u679c"}
>>> print payload_to_json(payload, max_size=34)
{"drink":"c","eat":"\u7cd6\u679c"}
>>> print payload_to_json(payload, max_size=30)
{"drink":"c","eat":"\u7cd6"}
>>> print payload_to_json(payload, max_size=21)
{"drink":"","eat":""}
>>> print payload_to_json(payload, max_size=20)

It seems to me that there should be a way to optimize this! I'm really stripping one character at a time, it feels so wrong.

My question is very close to this one, except I use python 2.7, and the json encoder produces pretty long JSON strings whenever the source strings contain non-ASCII unicode chars.

Plus I'm pretty sure this will break with UTF-16 surrogate pairs...

Upvotes: 2

Views: 3935

Answers (3)

caesarsol
caesarsol

Reputation: 2113

why don't you use the strategy in the post you linked: you measure the first generated json, then you strip from the values the right amount of chars in the preferred order.

otherwise you could guess the number of chars json uses by counting: for each mapped variable these chars "":"", plus the overall {}, minus a comma. (unless you don't have a more complicated nested list, obviously)

the unicode functionality shouldn't be a problem as long as you use u'' notation (not sure, but shouldn't be difficult to check)

Upvotes: 0

abarnert
abarnert

Reputation: 365697

If you're trying to make this faster (which you shouldn't be, unless you know this is a hotspot in your program with a real performance cost), you can first guess the number of characters to strip, and then deal with leftovers.

First, if you need to strip 52 characters, and there are 10 keys, you need to strip 6 chars each from 2 keys, and 5 each from the other 8, right? Except, of course, that you may be trying to strip 6 chars from something that's only 4 chars long, which means you'll end up still 2 chars over the limit. But you can keep track of those leftovers and deal with them after you're done. It's unlikely that there will be enough leftovers to make another pass through the "fast" version worth doing, so you might as well just use the "slow" version.

def payload_to_json(payload, max_size = 256):
    json_string = json.dumps(payload, separators = (',', ':'))
    chars_to_strip = len(json_string) - max_size
    if chars_to_strip <= 0:
        return json_string
    key_count = len(payload)
    chars_per_key, extras = divmod(chars_to_strip, key_count)
    leftover = 0
    for i, key in enumerate(payload):
        to_strip = chars_per_key + (i < extras)
        orig_len = len(payload[key])
        if orig_len < to_strip:
            payload[key] = ''
            leftover += to_strip - orig_len
        else:
            payload[key] = payload[key][:-to_strip]
    if leftover:
        return slow_payload_to_json(payload, max_size)
    else:
        return json.dumps(payload, separators = (',', ':'))

I'm not sure this actually will speed things up in your use cases. For very small objects and max sizes, I wouldn't be surprised if it actually slows things down. But for huge objects way over the max size, it would probably help a lot.

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

How about computing the serialized size of each entry.

Then choose as many elements such that you have the desired length?

Either way, this sounds like a really bad idea overall.

Upvotes: 0

Related Questions