Reputation: 8834
I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5
on the content string and compare it with the md5
of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5
bucket.get_key('file_name').base64md5
but both return None.
Is there any other way to achieve md5
without downloading the whole thing?
Upvotes: 18
Views: 23594
Reputation: 22332
You can recover md5
without downloading the file, from e_tag
attribute, like that:
boto3.resource('s3').Object(<BUCKET_NAME>, file_path).e_tag[1 :-1]
Then use this function to compare classic s3 files:
def md5_checksum(file_path):
m = hashlib.md5()
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
Or this function for multi-part files:
def etag_checksum(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5("".join(md5s))
return '{}-{}'.format(m.hexdigest(), len(md5s))
Finally use this function to choose between the two:
def md5_compare(file_path, s3_file_md5):
if '-' in s3_file_md5 and s3_file_md5 == etag_checksum(file_path):
return True
if '-' not in s3_file_md5 and s3_file_md5 == md5_checksum(file_path):
return True
print("MD5 not equals for file " + file_path)
return False
Credit to: https://zihao.me/post/calculating-etag-for-aws-s3-objects/
Upvotes: 3
Reputation: 2886
Since 2016, the best way to do this without any additional object retrievals is by presenting the --content-md5
argument during a PutObject request. AWS will then verify that the provided MD5 matches their calculated MD5. This also works for multipart uploads and objects >5GB.
An example call from the knowledge center:
aws s3api put-object --bucket awsexamplebucket --key awsexampleobject.txt --body awsexampleobjectpath --content-md5 examplemd5value1234567== --metadata md5checksum=examplemd5value1234567==
https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/
Upvotes: 1
Reputation: 1691
With boto3, I use head_object
to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
try:
md5sum = boto3.client('s3').head_object(
Bucket=bucket_name,
Key=resource_name
)['ETag'][1:-1]
except botocore.exceptions.ClientError:
md5sum = None
pass
return md5sum
Upvotes: 11
Reputation: 1161
yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key's MD5 without downloading it's contents.
Upvotes: 26