LZW decompression problem after Clear Code (Unix Compress .Z-files)

Question

I am implementing my own decompression code for decompressing Unix COMPRESS'ed .Z files. I have the basic decompression working and tested on smaller example files, but when I test it out on "real" files which may include the so called "Clear Code" (0x256), I run into trouble.

My code is able to decompress the file well up until that point, but after clearing the table and resetting the code length back to its initial size of 9, I notice the next code I am reading is faulty as it is larger than 255 (somewhere in the 400s). As it is the first entry since "resetting", I obviously don't have this entry in the table.

I have compared the codes read just after the reset code 0x256 by my code and SharpZipLib. I noticed that SharpzipLib seems to get a different code after the reset than I do. Our implementations are quite different so I am having a hard time finding the issue. I suspect the issue is that I start reading the bits in the wrong place somehow, though I have not managed to figured out what I am doing wrong...

Maybe a second pair of eyes would help?

Code:

namespace LzwDecompressor;
public class Decompressor
{

#region LZW Constants
private readonly byte[] _magicBytes = new byte[] { 0x1f, 0x9d };
private const int BlockModeMask = 0x80; //0b1000 0000
private const int MaxCodeBitsMask = 0x1f; //0b0001 1111
private const int InitialCodeLength = 9;
private const int ClearCode = 256;
#endregion

private int _maxBits;
private int _maxCode = (1 << InitialCodeLength) - 1;
private bool _blockMode;
private int _codeLength = InitialCodeLength;
private readonly Stream _inputStream;
public Decompressor(Stream stream) => _inputStream = stream;

public byte[] Decompress()
{
    if (_inputStream.Length < 3)
        throw new LzwDecompressorException("Input too small to even fit the required header.");

    ParseHeader();
    var dictionary = InitDict();

    using var outStream = new MemoryStream();
    var code = ReadCode();

    if (code >= 256) throw new LzwDecompressorException("The first code cannot be larger than 255!");

    outStream.Write(new[] { (byte)code }); //First code is always uncompressed

    var old = code;
    var nextIndex = _blockMode ? 257 : 256; //Skip 256 in block mode as it is a "clear code"
    while ((code = ReadCode()) != -1)
    {
        if (_blockMode && code == ClearCode)
        {
            _codeLength = InitialCodeLength;
            _maxCode = (1 << _codeLength) - 1;
            dictionary = InitDict();
            nextIndex = 257; //Block mode first index
            //Logically I should here be able to read the next code and write it instantly as the first code is basically uncompressed. But as the code is wrong, I cannot do that 
            //code = ReadCode();
            //outStream.Write(new [] { (byte)code });
            //old = code;
            continue;
        }

        var word = new List();

        if (dictionary.TryGetValue(code, out var entry))
        {
            word.AddRange(entry);
        }
        else if (dictionary.Count + 1 == nextIndex)
        {
            word.AddRange(dictionary[old].ToArray().Concat(new[] { dictionary[old][0] }));
        }

        if (word.Count > 0)
        {
            outStream.Write(word.ToArray());
            dictionary[nextIndex++] = new List(dictionary[old].ToArray().Append(word[0]));
            old = code;
        }

        if (_codeLength == _maxBits) continue; //prevent code length growing beyond max

        if (nextIndex == (1 << _codeLength))
        {
            _codeLength++;
            _maxCode = (1 << _codeLength) - 1;
            _ = dictionary.EnsureCapacity(1 << _codeLength);
        }
    }

    return outStream.ToArray();
}

#region Private methods
private void ParseHeader()
{
    if (_inputStream.ReadByte() != _magicBytes[0] || _inputStream.ReadByte() != _magicBytes[1])
    {
        throw new LzwDecompressorException("The given file does not contain the LZW magic bytes");
    }

    var descriptorByte = _inputStream.ReadByte();

    _maxBits = descriptorByte & MaxCodeBitsMask;
    _blockMode = (descriptorByte & BlockModeMask) > 0;
}

private static Dictionary> InitDict()
{
    var dict = new Dictionary>(1 << InitialCodeLength); //2⁹ max entries
    for (var i = 0; i < 256; i++) dict[i] = new List { (byte)i };

    return dict;
}

private int ReadCode()
{
    var code = 0x0;
    for (var i = 0; i < _codeLength; i++)
    {
        var bit = ReadBit();
        if (bit == -1) return -1;

        code |= bit << i;
    }
    return code;
}
#region Bit Reader
private int _currentBitMask = 0x100;
private int _currentByte;
private int ReadBit()
{
    if (_currentBitMask == 0x100)
    {
        _currentBitMask = 0x1;
        var newByte = _inputStream.ReadByte();
        if (newByte == -1) return -1;

        _currentByte = newByte;
    }
    var bit = (_currentByte & _currentBitMask) > 0 ? 1 : 0;
    _currentBitMask <<= 1;
    return bit;
}
#endregion
#endregion

}

public class LzwDecompressorException : Exception
{
    public LzwDecompressorException() { }
    public LzwDecompressorException(string message) : base($"LZW Decompressor: {message}") { }
    public LzwDecompressorException(string message, Exception inner) : base($"LZW Decompressor: {message}", inner) { }
}

I recognize that there may be other stuff missing still. Also performance issues and improvements are bound to be found. I have not paid too much attention to these yet as I am first and foremost looking to get it working until I start changing data structures and such for more performant variants.

LZW decompression problem after Clear Code (Unix Compress .Z-files)

Answers (1)

Related Questions