Michou Robert
Michou Robert

Reputation: 13

Calculate MD5 from AWS S3 ETag

I know is it possible to calculate ETag of a local stored file. That's not useful in my case. I got a chain where i zip file and directly upload them to S3 storage using memory :

zip -r - $input_path | tee >(md5sum - >> $MD5_FILE) >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null

After this i want to check if the ETag matches the md5 i calculated in the this command. Therefore i would like to know if it's possible (possibly using bash) to calculate md5checksum of the whole file knowing the ETag ?

Another way around would be to calculate ETag from the piped zip but i have no idea how do that (didn't have any result with wc -c)

Upvotes: 1

Views: 4963

Answers (1)

Anon Coward
Anon Coward

Reputation: 10823

You can't get the MD5 digest from an arbitrary ETag in S3. For non-encrypted objects uploaded with a single PutObject request, it is just an MD5 digest of the contents. For objects uploaded with multipart uploads, it is documented as a composite checksum. This means it is the digest of the digests of each part concated together, with a tag added to the end counting the number of parts. Since the MD5 hash algorithm is not reversable, you can't get the hash of the individual parts out of it.

For encrypted objects uploaded with any method, it is just documented as "not an MD5 digest of their object data".

So, if you want to compare the ETag of an object in S3 with what you create, you'll need to calculate the ETag using a the same technique as S3 does. md5 on it's own is not enough to do this with multipart uploads, you'll need something more complex. The following Python script will do just that, outputting either an MD5 digest for smaller files, or a digest of the parts of larger uploads:

#!/usr/bin/env python3

import sys
from hashlib import md5

MULTIPART_THRESHOLD = 8388608
MULTIPART_CHUNKSIZE = 8388608
BUFFER_SIZE = 1048576

# Verify some assumptions are correct
assert(MULTIPART_CHUNKSIZE >= MULTIPART_THRESHOLD)
assert((MULTIPART_THRESHOLD % BUFFER_SIZE) == 0)
assert((MULTIPART_CHUNKSIZE % BUFFER_SIZE) == 0)

hash = md5()
read = 0
chunks = None

while True:
    # Read some from stdin, if we're at the end, stop reading
    bits = sys.stdin.buffer.read(1048576)
    if len(bits) == 0: break
    read += len(bits)
    hash.update(bits)
    if chunks is None:
        # We're handling a multi-part upload, so switch to calculating 
        # hashes of each chunk
        if read >= MULTIPART_THRESHOLD:
            chunks = b''
    if chunks is not None:
        if (read % MULTIPART_CHUNKSIZE) == 0:
            # Dont with a chunk, add it to the list of hashes to hash later
            chunks += hash.digest()
            hash = md5()

if chunks is None:
    # Normal upload, just output the MD5 hash
    etag = hash.hexdigest()
else:
    # Multipart upload, need to output the hash of the hashes
    if (read % MULTIPART_CHUNKSIZE) != 0:
        # Add the last part if we have a partial chunk
        chunks += hash.digest()
    etag = md5(chunks).hexdigest() + "-" + str(len(chunks) // 16)

# Just show the etag, adding quotes to mimic how S3 operates
print('"' + etag + '"')

It is a drop in replacement for your md5 call:

$ zip -r - "$input_path" | tee >(python calculate_etag_from_pipe - >> "$MD5_FILE") >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null
[ ... zip file is created and uploaded to S3 ... ]

$ cat "$MD5_FILE"
"ef5c64605cb198b65b2451a76719b8d8-96"

$ aws s3api head-object --bucket $bucket_name --key $final_path_zip --query ETag --output text
"ef5c64605cb198b65b2451a76719b8d8-96"

Note that the script as shown makes some assumptions about how the upload will be split into a multi-part upload. These assumptions roughly map how the AWS CLI operates by default, but it is not guaranteed. If you're using a different SDK, or different settings for the CLI, you will need to adjust MULTIPART_THRESHOLD and MULTIPART_CHUNKSIZE.

Upvotes: 3

Related Questions