Tobi
Tobi

Reputation: 694

Performance issues while creating file checksums

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums. The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...

Any suggestions for improving this process? My hash function is the following:

private string getMD5(string filename)
        {
            using (var md5 = new MD5CryptoServiceProvider())
            {
                if (File.Exists(@filename))
                {
                    try
                    {
                        var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
                        var sb = new StringBuilder();
                        for (var i = 0; i < buffer.Length; i++)
                        {
                            sb.Append(buffer[i].ToString("x2"));
                        }
                        return sb.ToString();
                    }
                    catch (Exception)
                    {
                        Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
                        return "";
                    }
                }
                else
                {
                    return "";
                }
            }
        } 

Upvotes: 0

Views: 1635

Answers (2)

arbiter
arbiter

Reputation: 9575

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)

Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:

  • Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
  • The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
  • Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
  • Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
  • Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
  • And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
  • As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.

All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.

And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

Upvotes: 2

Christopher
Christopher

Reputation: 9804

In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.

Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.

Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.

Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.

A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.

Upvotes: 1

Related Questions