Matheos
Matheos

Reputation: 529

LZW decompression problem after Clear Code (Unix Compress .Z-files)

I am implementing my own decompression code for decompressing Unix COMPRESS'ed .Z files. I have the basic decompression working and tested on smaller example files, but when I test it out on "real" files which may include the so called "Clear Code" (0x256), I run into trouble.

My code is able to decompress the file well up until that point, but after clearing the table and resetting the code length back to its initial size of 9, I notice the next code I am reading is faulty as it is larger than 255 (somewhere in the 400s). As it is the first entry since "resetting", I obviously don't have this entry in the table.

I have compared the codes read just after the reset code 0x256 by my code and SharpZipLib. I noticed that SharpzipLib seems to get a different code after the reset than I do. Our implementations are quite different so I am having a hard time finding the issue. I suspect the issue is that I start reading the bits in the wrong place somehow, though I have not managed to figured out what I am doing wrong...

Maybe a second pair of eyes would help?

Code:

namespace LzwDecompressor;
public class Decompressor
{

#region LZW Constants
private readonly byte[] _magicBytes = new byte[] { 0x1f, 0x9d };
private const int BlockModeMask = 0x80; //0b1000 0000
private const int MaxCodeBitsMask = 0x1f; //0b0001 1111
private const int InitialCodeLength = 9;
private const int ClearCode = 256;
#endregion

private int _maxBits;
private int _maxCode = (1 << InitialCodeLength) - 1;
private bool _blockMode;
private int _codeLength = InitialCodeLength;
private readonly Stream _inputStream;
public Decompressor(Stream stream) => _inputStream = stream;

public byte[] Decompress()
{
    if (_inputStream.Length < 3)
        throw new LzwDecompressorException("Input too small to even fit the required header.");

    ParseHeader();
    var dictionary = InitDict();

    using var outStream = new MemoryStream();
    var code = ReadCode();

    if (code >= 256) throw new LzwDecompressorException("The first code cannot be larger than 255!");

    outStream.Write(new[] { (byte)code }); //First code is always uncompressed

    var old = code;
    var nextIndex = _blockMode ? 257 : 256; //Skip 256 in block mode as it is a "clear code"
    while ((code = ReadCode()) != -1)
    {
        if (_blockMode && code == ClearCode)
        {
            _codeLength = InitialCodeLength;
            _maxCode = (1 << _codeLength) - 1;
            dictionary = InitDict();
            nextIndex = 257; //Block mode first index
            //Logically I should here be able to read the next code and write it instantly as the first code is basically uncompressed. But as the code is wrong, I cannot do that 
            //code = ReadCode();
            //outStream.Write(new [] { (byte)code });
            //old = code;
            continue;
        }

        var word = new List<byte>();

        if (dictionary.TryGetValue(code, out var entry))
        {
            word.AddRange(entry);
        }
        else if (dictionary.Count + 1 == nextIndex)
        {
            word.AddRange(dictionary[old].ToArray().Concat(new[] { dictionary[old][0] }));
        }

        if (word.Count > 0)
        {
            outStream.Write(word.ToArray());
            dictionary[nextIndex++] = new List<byte>(dictionary[old].ToArray().Append(word[0]));
            old = code;
        }

        if (_codeLength == _maxBits) continue; //prevent code length growing beyond max

        if (nextIndex == (1 << _codeLength))
        {
            _codeLength++;
            _maxCode = (1 << _codeLength) - 1;
            _ = dictionary.EnsureCapacity(1 << _codeLength);
        }
    }

    return outStream.ToArray();
}

#region Private methods
private void ParseHeader()
{
    if (_inputStream.ReadByte() != _magicBytes[0] || _inputStream.ReadByte() != _magicBytes[1])
    {
        throw new LzwDecompressorException("The given file does not contain the LZW magic bytes");
    }

    var descriptorByte = _inputStream.ReadByte();

    _maxBits = descriptorByte & MaxCodeBitsMask;
    _blockMode = (descriptorByte & BlockModeMask) > 0;
}

private static Dictionary<int, List<byte>> InitDict()
{
    var dict = new Dictionary<int, List<byte>>(1 << InitialCodeLength); //2⁹ max entries
    for (var i = 0; i < 256; i++) dict[i] = new List<byte> { (byte)i };

    return dict;
}

private int ReadCode()
{
    var code = 0x0;
    for (var i = 0; i < _codeLength; i++)
    {
        var bit = ReadBit();
        if (bit == -1) return -1;

        code |= bit << i;
    }
    return code;
}
#region Bit Reader
private int _currentBitMask = 0x100;
private int _currentByte;
private int ReadBit()
{
    if (_currentBitMask == 0x100)
    {
        _currentBitMask = 0x1;
        var newByte = _inputStream.ReadByte();
        if (newByte == -1) return -1;

        _currentByte = newByte;
    }
    var bit = (_currentByte & _currentBitMask) > 0 ? 1 : 0;
    _currentBitMask <<= 1;
    return bit;
}
#endregion
#endregion

}

public class LzwDecompressorException : Exception
{
    public LzwDecompressorException() { }
    public LzwDecompressorException(string message) : base($"LZW Decompressor: {message}") { }
    public LzwDecompressorException(string message, Exception inner) : base($"LZW Decompressor: {message}", inner) { }
}

I recognize that there may be other stuff missing still. Also performance issues and improvements are bound to be found. I have not paid too much attention to these yet as I am first and foremost looking to get it working until I start changing data structures and such for more performant variants.

Upvotes: 0

Views: 215

Answers (1)

Matheos
Matheos

Reputation: 529

Update

I managed to get it working. I finally found an example suited for my use case (decompressing a Unix COMPRESS'ed file). I found this python code on GH: unlzw

The only thing I was missing was the following byte position calculation after resetting:

 # process clear code (256)
    if (code == 256) and flags:
        # Flush unused input bits and bytes to next 8*bits bit boundary
        rem = (nxt - mark) % bits
        if rem:
            rem = bits - rem
            if rem > inlen - nxt:
                break
            nxt += rem

I had already refactored my original code to work with a byte array instead of the memory stream I had originally. This simplified my thinking somewhat as I was now keeping track of the "current byte position" instead of the current byte. This makes it easier to, on Clear code, re-calculate the new byte position from which to continue reading. This python snippet which seems to be based on some old machine instructions according to this comment from the code:

Flush unused input bits and bytes to next 8*bits bit boundary (this is a vestigial aspect of the compressed data format derived from an implementation that made use of a special VAX machine instruction!)

Having implemented my own version of this calculation, based on my parameters I already had at hand, I managed to get it working! I must say I do not fully understand the logic behind it, but I am happy it works :)

Upvotes: 1

Related Questions