I want to write a program that does multithreaded compression / decompression using NET 3.5 and GZipStream library. The input files are very large (let's say hundreds of gigabytes) I would like to achieve this without having any intermediate files. This was my initial approach but the requirements have changed. I was thinking about following approaches and would like to verify if this looks good on paper: Read from the source file and split it into constant-sized chunks in memory. Keep track on number of threads as we have limited memory. Each chunk is compressed in memory by separate thread. These compressed chunks are pushed into a queue in proper order. There is one thread that reads from the queue and concatenates it into the output file. Also store somewhere some metadata regarding the compressed chunks that will be put later into the header. I would like to use this for decompression. Having done the above my idea for multithreaded decompression would be then: Read metadata file about the concatenated chunks. Read the data from the compressed file in chunks that are defined by metadata. Each chunk is decompressed by separate thread in memory. These decompressed chunks are added into the queue in proper order. There is a thread that concatenates decompressed chunks into a unified output file. Does the above seem plausible?

c#multithreadingstreamcompressiongzipstream

Radoslaw Jurewicz

Reputation: 137

How to do multithreaded compression / decompression with GZipStream without intermediate files on very large input

I want to write a program that does multithreaded compression / decompression using NET 3.5 and GZipStream library.
The input files are very large (let's say hundreds of gigabytes)
I would like to achieve this without having any intermediate files. This was my initial approach but the requirements have changed.

I was thinking about following approaches and would like to verify if this looks good on paper:

Read from the source file and split it into constant-sized chunks in memory.
Keep track on number of threads as we have limited memory.
Each chunk is compressed in memory by separate thread.
These compressed chunks are pushed into a queue in proper order.
There is one thread that reads from the queue and concatenates it into the output file.
Also store somewhere some metadata regarding the compressed chunks that will be put later into the header. I would like to use this for decompression.

Having done the above my idea for multithreaded decompression would be then:

Read metadata file about the concatenated chunks.
Read the data from the compressed file in chunks that are defined by metadata.
Each chunk is decompressed by separate thread in memory.
These decompressed chunks are added into the queue in proper order.
There is a thread that concatenates decompressed chunks into a unified output file.

Does the above seem plausible?

Upvotes: 1

Answers (3)

Mark Adler

Reputation: 112384

Sure, that will work fine. As it happens, a concatenation of valid gzip files is also a valid gzip file. Each distinct decompressible stream is called a gzip member. Your metadata just needs the offset in the file for the start of each stream.

The extra block of a gzip header is limited to 64K bytes, so this may limit how small a chunk can be, e.g. on the order of tens to a hundred megabytes. For another reason, I would recommend that your chunks of data to compress be at least several megabytes each anyway -- in order to avoid a reduction in compression effectiveness.

A downside of concatenation is that you get no overall check on the integrity of the input. For example, if you mess up the order of the members somehow, this will not be detected on decompression, since each member's integrity check will pass regardless of the order. So you may want to include an overall check for the uncompressed data. An example would be the CRC of the entire uncompressed data, which can be computed from the CRCs of the members using zlib's crc32_combine().

I would interested to know if in your case you get a significant speedup from parallel decompression. The decompression is usually fast enough that it is I/O bound on the mass storage device being read from.

Upvotes: 1

bommelding

Reputation: 3037

Yes, when you treat every chunk as an independent item (it gets it own GZip stream) this should work. But it would add some overhead, your overall compression will be a bit lower.

For each chunk you would need the size and the sequence number to deserialize and resequence.
The receiver would have to resequence anyway so you could skip that on the sender.

But it's hard to estimate how much you would gain by this, the compression is a little CPU intensive but still much faster than most I/O devices.

Upvotes: 1

bommelding

Reputation: 3037

I don't think that GZip can be broken up thus way. The whole stream depends on some token dictionary (Huffman tree or a variation) at the start. As a hint, GZipStream.CanSeek() always returns false.

So your point 3. would fail - the chunks are not independent.

What might work is to process 2 or even 3 files in parallel, depending on you I/O hardware. More suited for a fast SSD than for an older HDD. Network I/O usually qualifies as a slow HDD.

Upvotes: 1

How to do multithreaded compression / decompression with GZipStream without intermediate files on very large input

Answers (3)

Related Questions