Reputation: 9461
According to Wikipedia:
[Ascii85 uses] the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value).
[btoa] Version 4.2 added a "y" exception for a group of all ASCII space characters
While 0 data might be quite common, that use of z
to compress 0's seems like an arbitrary optimization that won't always be of use.
Likewise, the less frequent use of y
is only of use if the raw bytes contain adjacent spaces. The Unicode encoding of space is actually 20 00
so 0x20202020
isn't all that common in Unicode texts.
Binary data does often have adjacent 00
's, but it also often contains adjacent FF
's.
Text data does often contain adjacent spaces, but it also often contains adjacent tab characters, or adjacent new-line characters.
It would seem that a frequency analysis, and usage of 9 or 10 characters (Ascii chars 118-126/127, or v
through ~
/DEL) to represent the 9/10 most frequent 32-bit values, might lead to better compression.
The mapping of compression-character to 32-bit value could perhaps sit at the start of the encoded string enclosed between <[
and ]>
. For 32-bit values that are 4 repeated bytes, the 32-bit value can be abbreviated to the repeated hex value(s).
For example:
The binary data (192 bytes):
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
Note the presence of spaces
20
, hyphens2D
, tabs09
and Unicode Carriage Return-Line Feeds0D 00 0A 00
Could be encoded as (79 bytes)
<[00;FF;20;2D;09;0D000A00]><~vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|~>
Is there merit in an encoding approach that uses such compression? Why aren't the various Ascii85 specs more aggressive with compression?
Upvotes: 0
Views: 347
Reputation: 81247
There are some applications for which it is useful to be able to find the Nth octet of an encoded string without having to scan the whole thing. Compression would interfere with that. There are, however, other applications for which certain forms of compression could be useful. If one can use more than 85 distinct characters, a base-85 coding will allow for easy compression using characters outside the primary set. Even if one is limited to a set of precisely 85 characters, the number of sequences of five base-85 characters is greater than the combined number of sequences of one, two, three, and four base-256 bytes, so there would be room to use some special combinations of characters to indicate e.g. runs of certain character values. The biggest problem is that doing so would forfeit the ability to perform random seeks within the encoded data stream.
Upvotes: 3
Reputation: 112502
Because you would normally use a compression program before encoding with ASCII85, which can do a much better job than the suggested ad hoc encodings.
Upvotes: 3