Reputation: 4444
I need to calculate checksums of quite large files (gigabytes). This can be accomplished using the following method:
private byte[] calcHash(string file)
{
System.Security.Cryptography.HashAlgorithm ha = System.Security.Cryptography.MD5.Create();
FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
byte[] hash = ha.ComputeHash(fs);
fs.Close();
return hash;
}
However, the files are normally written just beforehand in a buffered manner (say writing 32mb's at a time). I am so convinced that I saw an override of a hash function that allowed me to calculate a MD5 (or other) hash at the same time as writing, ie: calculating the hash of one buffer, then feeding that resulting hash into the next iteration.
Something like this: (pseudocode-ish)
byte [] hash = new byte [] { 0,0,0,0,0,0,0,0 };
while(!eof)
{
buffer = readFromSourceFile();
writefile(buffer);
hash = calchash(buffer, hash);
}
hash is now sililar to what would be accomplished by running the calcHash function on the entire file.
Now, I can't find any overrides like that in the.Net 3.5 Framework, am I dreaming ? Has it never existed, or am I just lousy at searching ? The reason for doing both writing and checksum calculation at once is because it makes sense due to the large files.
Upvotes: 35
Views: 19973
Reputation: 18305
I've just had to do something similar, but wanted to read the file asynchronously. It's using TransformBlock and TransformFinalBlock and is giving me answers consistent with Azure, so I think it is correct!
private static async Task<string> CalculateMD5Async(string fullFileName)
{
var block = ArrayPool<byte>.Shared.Rent(8192);
try
{
using (var md5 = MD5.Create())
{
using (var stream = new FileStream(fullFileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192, true))
{
int length;
while ((length = await stream.ReadAsync(block, 0, block.Length).ConfigureAwait(false)) > 0)
{
md5.TransformBlock(block, 0, length, null, 0);
}
md5.TransformFinalBlock(block, 0, 0);
}
var hash = md5.Hash;
return Convert.ToBase64String(hash);
}
}
finally
{
ArrayPool<byte>.Shared.Return(block);
}
}
Upvotes: 3
Reputation: 57976
Seems you can to use TransformBlock
/ TransformFinalBlock
, as shown in this sample: Displaying progress updates when hashing large files
Upvotes: 5
Reputation: 1457
I like the answer above but for the sake of completeness, and being a more general solution, refer to the CryptoStream
class. If you are already handling streams, it is easy to wrap your stream in a CryptoStream
, passing a HashAlgorithm
as the ICryptoTransform
parameter.
var file = new FileStream("foo.txt", FileMode.Open, FileAccess.Write);
var md5 = MD5.Create();
var cs = new CryptoStream(file, md5, CryptoStreamMode.Write);
while (notDoneYet)
{
buffer = Get32MB();
cs.Write(buffer, 0, buffer.Length);
}
System.Console.WriteLine(BitConverter.ToString(md5.Hash));
You might have to close the stream before getting the hash (so the HashAlgorithm
knows it's done).
Upvotes: 49
Reputation: 700572
You use the TransformBlock
and TransformFinalBlock
methods to process the data in chunks.
// Init
MD5 md5 = MD5.Create();
int offset = 0;
// For each block:
offset += md5.TransformBlock(block, 0, block.Length, block, 0);
// For last block:
md5.TransformFinalBlock(block, 0, block.Length);
// Get the has code
byte[] hash = md5.Hash;
Note: It works (at least with the MD5 provider) to send all blocks to TransformBlock
and then send an empty block to TransformFinalBlock
to finalise the process.
Upvotes: 53
Reputation: 48310
Hash algorithms are expected to handle this situation and are typically implemented with 3 functions:
hash_init()
- Called to allocate resources and begin the hash.
hash_update()
- Called with new data as it arrives.
hash_final()
- Complete the calculation and free resources.
Look at http://www.openssl.org/docs/crypto/md5.html or http://www.openssl.org/docs/crypto/sha.html for good, standard examples in C; I'm sure there are similar libraries for your platform.
Upvotes: 3