pixartist
pixartist

Reputation: 1155

How do compression algorithms search for patterns in vast data blobs?

Let's say I want to compress a file of 10GB. How can a program like winrar search for patterns in this vast data ?

I'm sure it won't search for each pattern individually on the hard disk, this would take forever. But it also can't buffer the entire dataset in my ram + buffering all possible patterns. this would take x-times the data amount of the original file. But splitting the file into smaller pieces also creates all sorts of problems, and you would still have to store all possible patterns in the ram, which would take more space than the file itself.

Upvotes: 0

Views: 138

Answers (2)

Thomas Mueller
Thomas Mueller

Reputation: 50087

Tools like WinRAR use data compression algorithms of the LZ, PPM, or BZIP family. Those have a relatively small window size (32 KB to about 8 MB), as Mark Adler wrote.

There is a class of data compression algorithms that look at all previous data to search for repeated blocks: this is called data deduplication. Here, the typical block sizes are much larger (for example 4 KB, or a variable size block around 4 KB, or even whole files, instead of just a few bytes as this is the case for the LZ family). Hashes are used to identify blocks; sometimes a cryptographic hash such as SHA1, sometimes a non-cryptographic hash. For cryptographic hashes, there is no need to verify it was really a match, as the probability of a hash collision is so low. In this, the data itself doesn't need to be kept in RAM, just the hashes. And for non-cryptographic hashed, only the hash and the position where the data was found needs to be kept in memory. This works fine because the blocks are larger.

In the future, it is very likely that regular data compression tools combine both techniques to compress data even better, specially for large archives.

Upvotes: 0

Mark Adler
Mark Adler

Reputation: 112284

They don't. The usual LZ77 approaches define a sliding window preceding the data being compressed in which the compressor searches for matching strings. The window size might be from 32K for deflate to 8 MB (default) for LZMA. Burrows-Wheeler transform approaches, e.g. bzip2, have specified block sizes (900K default for bzip2).

Upvotes: 1

Related Questions