ThunderFrame
ThunderFrame

Reputation: 9461

Why doesn't Ascii85 encoding allow for dynamic compression?

According to Wikipedia:

[Ascii85 uses] the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value).

[btoa] Version 4.2 added a "y" exception for a group of all ASCII space characters

While 0 data might be quite common, that use of z to compress 0's seems like an arbitrary optimization that won't always be of use.

Likewise, the less frequent use of y is only of use if the raw bytes contain adjacent spaces. The Unicode encoding of space is actually 20 00 so 0x20202020 isn't all that common in Unicode texts.

Binary data does often have adjacent 00's, but it also often contains adjacent FF's.

Text data does often contain adjacent spaces, but it also often contains adjacent tab characters, or adjacent new-line characters.

It would seem that a frequency analysis, and usage of 9 or 10 characters (Ascii chars 118-126/127, or v through ~/DEL) to represent the 9/10 most frequent 32-bit values, might lead to better compression.

The mapping of compression-character to 32-bit value could perhaps sit at the start of the encoded string enclosed between <[ and ]>. For 32-bit values that are 4 repeated bytes, the 32-bit value can be abbreviated to the repeated hex value(s).

For example:

The binary data (192 bytes):

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00

Note the presence of spaces 20, hyphens 2D, tabs 09 and Unicode Carriage Return-Line Feeds 0D 00 0A 00

Could be encoded as (79 bytes)

<[00;FF;20;2D;09;0D000A00]><~vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|~>

Is there merit in an encoding approach that uses such compression? Why aren't the various Ascii85 specs more aggressive with compression?

Upvotes: 0

Views: 347

Answers (2)

supercat
supercat

Reputation: 81247

There are some applications for which it is useful to be able to find the Nth octet of an encoded string without having to scan the whole thing. Compression would interfere with that. There are, however, other applications for which certain forms of compression could be useful. If one can use more than 85 distinct characters, a base-85 coding will allow for easy compression using characters outside the primary set. Even if one is limited to a set of precisely 85 characters, the number of sequences of five base-85 characters is greater than the combined number of sequences of one, two, three, and four base-256 bytes, so there would be room to use some special combinations of characters to indicate e.g. runs of certain character values. The biggest problem is that doing so would forfeit the ability to perform random seeks within the encoded data stream.

Upvotes: 3

Mark Adler
Mark Adler

Reputation: 112502

Because you would normally use a compression program before encoding with ASCII85, which can do a much better job than the suggested ad hoc encodings.

Upvotes: 3

Related Questions