Reputation: 1155
Let's say I want to compress a file of 10GB. How can a program like winrar search for patterns in this vast data ?
I'm sure it won't search for each pattern individually on the hard disk, this would take forever. But it also can't buffer the entire dataset in my ram + buffering all possible patterns. this would take x-times the data amount of the original file. But splitting the file into smaller pieces also creates all sorts of problems, and you would still have to store all possible patterns in the ram, which would take more space than the file itself.
Upvotes: 0
Views: 138
Reputation: 50087
Tools like WinRAR use data compression algorithms of the LZ, PPM, or BZIP family. Those have a relatively small window size (32 KB to about 8 MB), as Mark Adler wrote.
There is a class of data compression algorithms that look at all previous data to search for repeated blocks: this is called data deduplication. Here, the typical block sizes are much larger (for example 4 KB, or a variable size block around 4 KB, or even whole files, instead of just a few bytes as this is the case for the LZ family). Hashes are used to identify blocks; sometimes a cryptographic hash such as SHA1, sometimes a non-cryptographic hash. For cryptographic hashes, there is no need to verify it was really a match, as the probability of a hash collision is so low. In this, the data itself doesn't need to be kept in RAM, just the hashes. And for non-cryptographic hashed, only the hash and the position where the data was found needs to be kept in memory. This works fine because the blocks are larger.
In the future, it is very likely that regular data compression tools combine both techniques to compress data even better, specially for large archives.
Upvotes: 0
Reputation: 112284
They don't. The usual LZ77 approaches define a sliding window preceding the data being compressed in which the compressor searches for matching strings. The window size might be from 32K for deflate to 8 MB (default) for LZMA. Burrows-Wheeler transform approaches, e.g. bzip2, have specified block sizes (900K default for bzip2).
Upvotes: 1