Konstantin Tugeev
Konstantin Tugeev

Reputation: 39

LZ77 optimization

I have a code for LZ77 compression algorithm. It works fine with small files. But if I want to compress 100 kB and bigger files, it takes a lot of time.

I think, it's all because of this part:

do { // searchin for longest mathcing
                j++;
                i++;
                if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);

                if (j == strlen(searchBuffer) && lookBuffer[addJ] == lookBuffer[i]) {
                    do {
                        i++;
                        j++;
                        if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);
                        addJ++;
                    } while (lookBuffer[addJ] == lookBuffer[i] && i < lookLen);
                }
            } while (searchBuffer[j] == lookBuffer[i] && i < lookLen);

This part searches the longest matching between search buffer and look-ahead buffer.

So, is there faster way to find this matching?

Full code for LZ77 compression:

 void lz77(char *fileName, FILE *from, FILE *path, long long size) {

char *searchBuffer = (char*)malloc((searchLen) * sizeof(char)); // search array
memset(searchBuffer, 0x00, searchLen);

long i = 0, j = 0, length = 0, offset = 0, isMatching = 0, iForSearchBuff = 0, iForLookBuff = 0;
int isFirst = 1, n = 0;
unsigned char symbol;

while(!feof(from)) {
    isMatching = 0;
    length = 0;
    iForSearchBuff = 0;
    iForLookBuff = 0;
    offset = strlen(searchBuffer);
    if (isFirst) { // first symbol
        fread(searchBuffer, 1, 1, from); 
        writeBit(0, 0, *searchBuffer, tmpFile); // writing(0, 0, symbol)
        isFirst = 0;
    } else {
        char *lookBuffer = (char*)malloc((lookLen) * sizeof(char)); // look array
        memset(lookBuffer, 0x00, lookLen);

        fread(lookBuffer, 1, 1, from);

        for (j = 0; j <= strlen(searchBuffer); j++) {

            i = 0;
            if (searchBuffer[j] == lookBuffer[i]) { // MATCHING
                isMatching = 1;
                long fixJ = j, fixI = i, addJ = 0;

                do { // searchin for longest mathcing
                    j++;
                    i++;
                    if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);

                    if (j == strlen(searchBuffer) && lookBuffer[addJ] == lookBuffer[i]) {
                        do {
                            i++;
                            j++;
                            if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);
                            addJ++;
                        } while (lookBuffer[addJ] == lookBuffer[i] && i < lookLen);
                    }
                } while (searchBuffer[j] == lookBuffer[i] && i < lookLen);

                if (((strlen(searchBuffer) - fixJ) <= offset) && ((i - fixI) >= length)) { 
                    length = i - fixI;
                    offset = strlen(searchBuffer) - fixJ;
                    iForSearchBuff = fixI;
                    iForLookBuff = i;
                }

                j = fixJ;

            } else if (j >= (strlen(searchBuffer))) {
                if (isMatching) {
                    if (lookBuffer[iForLookBuff] == '\0') { writeBit(offset, length, ' ', tmpFile);}
                    else {
                        writeBit(offset, length, lookBuffer[iForLookBuff], tmpFile);
                        searchBuffer = insertIntoBuffer(searchBuffer, length + 1, &lookBuffer[iForSearchBuff]);
                    } 

                    isMatching = 0;
                    break;
                } else {
                    writeBit(0, 0, lookBuffer[i], tmpFile);

                    searchBuffer = insertIntoBuffer(searchBuffer, 1, &lookBuffer[i]);
                    break;
                }
            }

        }
    }
}

}

Upvotes: 0

Views: 886

Answers (3)

Mark Adler
Mark Adler

Reputation: 112189

zlib's deflate keeps a hash table using the first three bytes at each location as the input to the hash function. For hash collisions a list is maintained for each hash value, where the length of the list is limited depending on the compression level. The hash value is computed for the first three bytes at the current position to be compressed. For each preceding position with the same hash value and within the allowed distance, the data at that position is then directly compared to find the length of the match. Then the longest match found is emitted, or the literal if there is no match of three or more bytes.

Upvotes: 3

Brecht Sanders
Brecht Sanders

Reputation: 7287

I'm not an expert, but clever people have devised efficient search algorithms that may apply here. For example check out the Knuth–Morris–Pratt algorithm https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Upvotes: 1

Chris Hall
Chris Hall

Reputation: 1772

Keep a table indexed by the first 'n' bytes of each string in your window -- where 'n' is the minimum length match you want (probably 2). You may have multiple strings starting with the same 'n' bytes, so the entries in this table are lists of strings and each one's offset in the window. You may want to prefer longest match at smallest offset from the end of the window. It can be worth dealing specially with strings that start with a long run of the same byte. As you move the window forwards, you need to drop strings from the table and add new ones.

Upvotes: 1

Related Questions