ysurilov
ysurilov

Reputation: 300

Compressing efficiency

There is a task to save data in some local storage or for network transmission. Data are stored in plain object form (key-value pairs). With the aim to save bandwith and storage space I'm planning to use the conversion from long verbose key names to corresponding digits (using map) and restore them on receiving data. For example:

var statesMap = {
    SEARCH: 0,
    SORT: 1,
    FILTER: 2,
    DISPLAY_FAV: 3,
    PANEL_POS: 4,
    MENU_LIST_POS: 5,
    MAIN_LIST_POS: 6,
    INFO_LIST_POS: 7,
    CHANNEL: 8
};

config = {
        APP: ['SEARCH', 'SORT', 'DISPLAY_FAV', 'PANEL_POS', 'MENU_LIST_POS', 'MAIN_LIST_POS', 'CHANNEL']
    };

app.state = {};

// restore state from {"0":"ui","1":"NAME","3":false,"4":0,"5":3,"6":0}
config[appName.toUpperCase()].forEach(function ( state ) {
    var value = storage[statesMap[state]]; // 'storage' stores my data
    // convert "compressed" properties to full equivalents
    app.state[state] = value != null ? value : '';
});

// result {"SEARCH":"ui","SORT":"NAME","DISPLAY_FAV":false,"PANEL_POS":0,"MENU_LIST_POS":3,"MAIN_LIST_POS":0,"CHANNEL":""}

// store data
var tmp = {};

// filter fields with empty values, don't store them
Object.keys(app.state).filter(function ( key ) { return !!String(app.state[key]); }).forEach(function ( state ) {
    tmp[statesMap[state]] = app.state[state];
});

storage = tmp;

Are there sufficient benefits and advantages with this approach? Are there better optimizations? Does this optimization interfere with gzip compression algorithm?

Thanks a lot.

Upvotes: 1

Views: 355

Answers (2)

Ismael Miguel
Ismael Miguel

Reputation: 4241

Here is my take on your question.

If you use your idea, don't use gzip or use it when the data is longer than x bytes.

The best is to do not use it.


This is why I say to do not use your method:

  • Readability is more important than saving 30 bytes
  • You add complexity on the server and client code
  • It difficults debugging, since nothing is clear anymore
  • Expandability is extremelly difficult and prone to errors.

Using your method has the following advantages:

  • If you are accessing the data from a browser without gzip support, this will still save you bandwidth
  • Since you know the format, you can achieve better compression without gzip

But, you have to consider this:

  • Gzip (and all compression formats) work better with bigger and repeated data. That is where they shine. If you remove the redundancy yourself and send a small output, you're wasting processing power to only increase the output size. This output without redudancy is impossible to compress.
  • Using your method will reduce the readability significativelly and remove all the flexibility. It will force you to place all required data at the beginning and the optional at the end. It won't fit every test case you might have and will force you to increase your logic code to deal with that.

One example is, if in the future, you want to add debugging information or fields that you wont always return. Here's a very basic example:

  • If the operation succeeds: {"error":false,[... data goes here ...]}
  • If some "soft-error" happens (eg: failed validation): {"error":true,"type":"..."}
  • If a nasty error happened: {"error":500,"desc":"Service unavaliable"}

With your code, the 2nd field can either be "desc", "type" or anything.

You could say: "We can add the desc and type fields and always send them!". And now, you are adding more data, defeating the reason to use your method.


Backing up my claims

What kind of answer would mine be without some data do backup bold statements like "impossible to compress"?

Lets consider your example data:

{"SEARCH":"ui","SORT":"NAME","DISPLAY_FAV":false,"PANEL_POS":0,"MENU_LIST_POS":3,"MAIN_LIST_POS":0,"CHANNEL":""}

Which you (non-optimally) reduced to this:

{"0":"ui","1":"NAME","3":false,"4":0,"5":3,"6":0}

All the compression values will be using this base PHP code:

$output = '<content>';

echo 'gzip:', strlen(gzencode($output)), ' original:', strlen($output);

Which shows the gzipped size and the original size, side by side. I will keep the code as simple as possible, to avoid alienating non-php developers.


Here's the results I've gotten from different executions:

  • {"SEARCH":"ui","SORT":"NAME","DISPLAY_FAV":false,"PANEL_POS":0,"MENU_LIST_POS":3,"MAIN_LIST_POS":0,"CHANNEL":""}
    Result: gzip:112 original:112 (+0 bytes)
  • {"0":"ui","1":"NAME","3":false,"4":0,"5":3,"6":0}
    Result: gzip:65 original:49 (+16 bytes over the original)
  • ["ui","NAME",false,0,3,0] (proper JSON output, based on your object)
    Result: gzip:45 original:25 (+20 bytes over the original)

You can try the code on http://sandbox.onlinephpfunctions.com/code/d8d0799147e3256ede2b730cb1ded7cf66c1eb67

The output was obtained from comments in the question, and not made up by me. Except the last one, which is based on the 2nd output. The 2nd output shows an array being represented as an object (sequencial numeric keys).


Okay, I said many words, showed some code and what not. My conclusion?

For your case, either use gzip or use your method. I would stick to plain and simple gzip, and call it a day.

You won't be needlessly increasing your output size.

Even then, if you really want your method, disable gzip and save CPU cicles on your server.

Upvotes: 0

BeeOnRope
BeeOnRope

Reputation: 64905

The optimization you are referring to might be called "token replacement" or something like that and is a reasonable approach to domain-specific compression.

This type of transformation doesn't prevent matching+entropy based algorithms like gzip from working, and so you aren't likely to get a larger final size after applying this transformation. That said, the replacement you are doing is exactly the type of thing that gzip is good at doing, so doing it yourself before invoking gzip may be a bit redundant.

To know for sure, you can simply test! What are the typical results of your token replacement + gzip, versus gzip alone?

Even without testing, here are some advantages and disadvantages of the token replacement-before-gzip based approach:

Advantages

  1. Sometimes you can do the token replacement more efficiently than gzip, if you do it early in the output generation, and can use this more compact form through most of your processing chain. You may get speedups in other parts of your code because things like string comparisons are now replaced by very fast integer comparisons (however, see disadvantage #1).
  2. gzip has certain limitations based partly on its age and huge range of target hardware, that you can work around with your token replacement. For example, it only finds matches in a 32 KiB window, while your token replacement works on the entire format.
  3. Token replacement effectively takes advantage of your knowledge of the format to encode more efficiently than gzip could. For example, although gzip will perform largely the same substitutions as you are for frequently occurring keys, your replacement will do well for infrequently occurring keys (but this doesn't matter if there are a lot of them). In effect, you are able to make use of a pre-distributed out of band dictionary, which helps compression.

Disadvantages

  1. Gzip is usually efficiently implemented using native libraries provided by your OS or runtime environment. Especially since you are calling gzip anyways, just using gzip is may be faster than doing your token replacement at the language level. How much faster (or slower) depends a lot on your implementation.
  2. Compression-wise gzip is largely doing the same replacement operation you are (plus many more), so replacing tokens yourself is somewhat redundant. That is, gzip looks for "matches": strings that already occurred earlier in the text, and replaces them with tokens, just like you do. Plus it does a lot of other good stuff, like entropy coding the tokens, so you want it as a final step no matter what.
  3. Adding your own pre-compression step may not harm compression (even if it doesn't help it much), but it adds complexity, a source for bugs, possibly performance issues and so on. You need to somehow transmit the key <-> integer mapping to remote clients, etc, etc.

Basically, I would recommend against it, unless your testing shows that it provides a significant performance boost. Usually it won't, since gzip is already removing most of the redundancy, but it depends on the specifics of your situation.

Upvotes: 3

Related Questions