Reputation: 15981
I'm using the Microsoft.Azure.Storage.DataMovement nuget package to transfer multiple, very large (150GB) files into Azure cold storage using
TransferManager.UploadDirectoryAsync
It works very well, but a choke point in my process is that after upload I am attaching to the FileTransferred
event and reading the local file all over again to calculate the md5 checksum and compare it to the remote copy:
private void FileTransferredCallback(object sender, TransferEventArgs e)
{
var sourceFile = e.Source.ToString();
var destinationFile = (ICloudBlob) e.Destination;
var localMd5 = CalculateMd5(e.Source.ToString());
var remoteMd5 = destinationFile.Properties.ContentMD5;
if (localMd5 == remoteMd5)
{
destinationFile.Metadata.Add(Md5VerifiedKey, DateTimeOffset.UtcNow.ToDisplayText());
destinationFile.SetMetadata();
}
}
It is slower than it needs to be since every file is getting double handled - first by the library, then by my MD5 check.
Is this check even necessary or is the library already doing the heavy lifting for me? I can see Md5HashStream but after quickly looking through the source it isn't clear to me if it is being used to verify the entire remote file.
Upvotes: 0
Views: 860
Reputation: 6467
Note that metadata blob.Properties.ContentMD5
of the entire blob is actually set by Microsoft.Azure.Storage.DataMovement library per its local calculation result after uploading all the blocks of this blob, not by Azure Storage Blob Service.
The data integrity of blob uploading is guaranteed by Content-MD5 HTTP header when putting every single block, not by metadata blob.Properties.ContentMD5
of the entire blob, since Azure Storage Blob Service doesn't really validate the value when Microsoft.Azure.Storage.DataMovement library is setting metadata (check the introduction of x-ms-blob-content-md5 HTTP header).
The main purpose of blob.Properties.ContentMD5
is to verify the data integrity when downloading the blob back to local disk via Microsoft.Azure.Storage.DataMovement library (if DownloadOptions.DisableContentMD5Validation
is set to false, which is the default behavior).
Upvotes: 1
Reputation: 24549
Is this check even necessary or is the library already doing the heavy lifting for me?
Based on my knowledge, we just need to check the blob whether there is a value for the ContentMD5 propetry.
When using Microsoft.Azure.Storage.DataMovement to upload the large file,it is actually composed of multiple PutBlock requests plus one PutBlockList request.Each PutBlock request uploads only part of the content, so MD5 in such requests may only be for the current upload content, and can not be used as the final blob MD5 value.
The contents of the PutBlockList request is a list of all the above upload Block identity, so the MD5 value of this request can only be done on this list integrity check.
when all of these requests are validated, the integrity of the content is guaranteed. For the sake of performance, the Storage server does not summarize the contents of all the blocks in the previous request to calculate the MD5 value of the entire blob, but provides a special request header, x-ms-blob-content-md5, The end will set this header property value to the blob's MD5 value.So the client as long as the final PutBlockList request set the entire contents of the MD5 value to x-ms-blob-content-md5, then ensure the verification, blob also has the MD5 value.
So the blocking upload MD5 based on the integrity of the work process is:
In summary, when blocking upload, it depends on whether x-ms-blob-content-md5 has a value.
Upvotes: 0