Compressing data in a txt file

Question

I am looking for advice which method I could use to compress the following data: 65|018 03C 066 066 066 07E 066 066 066 000 000 000 000 000 000 000 66|07C 066 066 066 07C 066 066 066 07C 000 000 000 000 000 000 000 67|03C 066 066 060 060 060 066 066 03C 000 000 000 000 000 000 000 ... , formatted as | ... <16TH ROW OF PIXELS>.

Where one row of pizels can be 3 hex digits. and make it less demanding to read. I already tried Run-Length encoding (because of lots of redundant data), but it didn't work well. It also needs to be simple enough to decompress.

Dan Mašek · Accepted Answer

The simplest approach would be to just use an existing solution, say zlib, and decompress to memory before parsing. But that makes a boring answer, so let's look at what we can achieve ourselves.

Representing the data in textual form is rather inefficient. In the current format, we need 68 bytes (assuming line terminator is CRLF) to represent one record.

The record itself consists of an ASCII code (1 byte), and 16 rows of pixels. Each row is represented by 3-digit hexadecimal number, i.e. 12 bits. That's 25 bytes total, if we pack 2 rows into 3 bytes, or 33 bytes total if we just use a 16bit integer for each row. That's 37% or 49% of the original space requirement, and encoding/decoding it is more or less trivial.

The ASCII codes, are continuous, but don't start from 0, and don't run all the way to 255. You could sort the records by the code, store the index of the first and last, along with a bitmap for this range that indicates missing symbols. Then you wouldn't have to store the code with each record.

It seems that vast majority of the symbols contain empty columns on the left, and empty rows in the bottom (gaps on the top are less common).

You could store the number of such empty columns and rows, and then avoid saving those bits. Since there are 12 columns, 4 bits are enough. Skipping 12 or more columns means missing glyph (so we don't need the bitmap mentioned earlier). Another 4 bits to represent rows skipped on the bottom (since skipping all columns is like skipping all rows, we don't need to be able to skip 16 rows).

You could take some inspiration from Lempel-Ziv algorithms to compress glyphs.

Let us define the following opcodes:

0 - Verbatim Row. Next 12 bits are the pixels.
1 - Copy Row. Next 4 bits are offset (or perhaps index) of the row to copy from.
2 - [optional] Empty row. No parameter needed. Current row is set to all 0s.
3 - [optional] End glyph. No parameter needed. Current and all following rows are set to 0s.

Note: The empty row could instead be represented as copying with 0 offset.

Now each glyph can be represented as a sequence of those opcodes and their parameters.

The probability distributions of 0s and 1s in the glyph images are probably skewed. Similar case will be with the opcodes and offsets in the LZ-like scheme. Entropy coding them might be a good choice (using something like arithmetic coder that can encode close to entropy). You could also consider making the model context sensitive.

Finally you could mix and match the approaches mentioned (fall back to raw representation if LZ expands the data).

Full source code with a number of variants is on pastebin, due to size I'll include the interesting bits.

Comparison of the efficiency of the different variants (0 is unmodified) on the sample input is as follows:

Variant 0: 6547 bytes
Variant 1: 3194 bytes
Variant 2: 2434 bytes
Variant 3: 1493 bytes
Variant 4: 1483 bytes
Variant 5: 1296 bytes
Variant 6: 1152 bytes
Variant 7: 1011 bytes
Variant 8: 839 bytes
Variant 9: 789 bytes
Variant 10: 669 bytes

Simple structure to represent the font in memory:

struct glyph
{
    uint8_t code;
    std::array rows;
};

struct font
{
    std::string header;
    std::vector glyphs;

    void sort()
    {
        std::sort(glyphs.begin(), glyphs.end()
            , [](glyph const& l, glyph const& r) -> bool {
                return l.code < r.code;
            });

    }
};

And a simple output bitstream implementation:

struct simple_bitstream
{
    simple_bitstream(std::vector& buffer)
        : buf_(buffer)
        , temp_(0)
        , temp_size_(0)
    {
    }

    void write_bits(uint32_t v, uint8_t bits)
    {
        if (bits) {
            write_bits(v >> 1, bits - 1);
            write_bit(v & 1);
        }        
    }

    void write_bit(uint8_t v)
    {
        temp_ = (temp_ << 1) | (v & 1);
        ++temp_size_;
        if (temp_size_ == 8) {
            buf_.push_back(temp_);
            temp_size_ = 0;
            temp_ = 0;
        }
    }

    void flush()
    {
        for (; temp_size_;) {
            write_bit(0);
        }
    }

    std::vector& buf_;
    uint8_t temp_;
    uint8_t temp_size_;
};

Variant 6 -- the Lempel-Ziv inspired algorithm with 4 opcodes.

// Find nearest identical preceding row in this glyph
uint8_t find_nearest_copy(glyph const& g, uint32_t i)
{
    uint8_t offset(0);
    uint16_t row(g.rows[i]);
    for (uint8_t j(1); j < i; ++j) {
        if (row == g.rows[i - j]) {
            offset = j;
            break;
        }
    }
    return offset;
}

uint32_t find_end_row(glyph const& g)
{
    uint32_t end_row(16);
    for (uint32_t i(0); i < 16; ++i) {
        if (g.rows[15 - i] > 0) {
            break;
        }
        --end_row;
    }
    return end_row;
}

void encode_v6(font const& f, std::vector& buffer)
{
    uint32_t OP_VERBATIM(0), OP_COPY(1), OP_EMPTY(2), OP_END(3);

    encode_header(f, buffer);
    simple_bitstream b(buffer);
    for (glyph const& g : f.glyphs) {
        // Code using 1 byte
        b.write_bits(g.code, 8);

        uint32_t end_row(find_end_row(g));
        for (uint32_t i(0); i < end_row; ++i) {
            uint16_t row(g.rows[i]);
            if (row == 0) {
                // Empty row
                b.write_bits(OP_EMPTY, 2);
                continue;
            }

            // Find nearest identical preceding row in this glyph
            uint8_t offset(find_nearest_copy(g, i));
            if (offset) {
                // Copy with non-zero offset
                b.write_bits(OP_COPY, 2);
                b.write_bits(offset - 1, 4);
            } else {
                // Verbatim row
                b.write_bits(OP_VERBATIM, 2);
                b.write_bits(row, 12);
            }
        }
        if (end_row < 16) {
            // End the glyph (any remaining rows are empty)
            b.write_bits(OP_END, 2);
        }
    }
}

Variants 8-10 do entropy coding using FastAC (arithmetic codec), and some additional simple modelling.

Compressing data in a txt file

Answers (1)

Related Questions