user18024637
user18024637

Reputation: 43

Zip file - Parallelism

I want to Zip all the files in a folder by using c#. I am not sure it's a good idea to use a Parallel loop to add files into a zip folder something like below. I need to handle almost 20k files with 2 MB each. Looking for a suggestion.

            using (var archive = ZipFile.Open(zf, ZipArchiveMode.Create))
            {
               Parallel.ForEach(files, po, fPath =>
               {
                   archive.CreateEntryFromFile(fPath, Path.GetFileName(fPath));
               });
            }

Upvotes: 2

Views: 3005

Answers (3)

mateli
mateli

Reputation: 11

Something like this is probably as good as it gets using System.IO.Compression.ZipArchive:

var fileContent = dirTargetNames.AsParallel().WithDegreeOfParallelism(100).Select(s => new KeyValuePair<string, byte[]>(s, File.ReadAllBytes(s))).ToImmutableDictionary();
foreach (var p in fileContent) {
    //Do not remove "using" as it makes sure that the entry is closed before moving to the next iteration
    using var f = zip.CreateEntry(p.Key).Open();
    f.Write(p.Value);
}

Otherwise look into using pigz which as far as I know only exists as a command line tool. It will do everything in parallel. Reading, compressing and writing. As far as I know it is the only tool that can really do fully parallel zip compression.

However in most cases reading many small files will be the bottleneck. You may want to adjust this code to only read files under a certain size and do the rest with CreateEntryFromFile. Also this code doesn't preserve attributes and timestamps. As well as obvious things that can go wrong like ReadAllBytes throwing an exception.

Note that you may need a large number of threads to speed things up when windows is trying to figure out if you have the rights to open a file!

Upvotes: 1

JonasH
JonasH

Reputation: 36371

System.IO.Compression.ZipArchive is not thread safe. So you cannot use it from a parallel loop. The common convention is that static methods should be thread safe, but objects are not, unless otherwise noted in the documentation. So never just put things in parallel loops without verifying that all used classes are thread safe.

It is perfectly possible that the implementation is parallelized internally, but as far as I know, it is not. There might be other libraries available that support multi threading, you could ask at https://softwarerecs.stackexchange.com/ for such a library.

The benefit of multithreading will depend on the compression algorithm used. Some algorithms are very lightweight and will be limited by disk speed, while some algorithms will be CPU limited. The most common algorithm is Deflate, and is fairly fast, but not as fast as something like lz4.

If you are archiving already compressed data, like images, you should specify CompressionLevel.NoCompression to improve speed, since compression would not help anyway.

You might also split your work in some different way, i.e. compress different folders concurrently. Or make multiple archives of say 1k files each that can be processed in parallel.

Upvotes: 2

Theodor Zoulias
Theodor Zoulias

Reputation: 43535

This is not a hands-on/technical answer. In general trying to parallelize operations that are mostly I/O-bound doesn't yield significant performance improvements, because the bottleneck is not the CPU but the capabilities of the storage device. Some hardware, like the SSD devices, react better to parallelization than others. Classic hard disks in particular react rather poorly to parallelization, with the performance usually being even worse than doing the work sequentially.

Compressing a file involves both I/O and computation, but the CPU part is pretty lightweight. So overall it's mostly an I/O-bound operation. You could attempt to implement the producer-consumer pattern like it is suggested in this answer, but don't expect to get any spectacular performance gains from it.

Upvotes: 0

Related Questions