gsutil unable to validate hashes for file uploaded by Kafka Connect S3

Question

I'm trying to transfer some files from a Kafka s3 sink to Google Cloud Storage bucket using gsutil. Because the Kafka-Connect to S3 does the multi-part upload, the Etag of the uploaded files (even though they are small) is not an MD5 hash which is causing the gsutil to throw integrity check warning. I want to know if there is a way to handle the integrity check in such a scenario or I should just ignore the warnings?

I've tried both cp and rsync commands and they both have the same behavior.

gsutil -m cp -r s3://somebucket/folder gs://somebucket/folder
gsutil -m rsync -r s3://somebucket/folder gs://somebucket/folder

Non-MD5 etag ("7dc7e8a64434da88964f3d65f1e05c6b-1") present for key , data integrity checks are not possible.

WARNING: Found no hashes to validate object downloaded from s3://source-bucket/source-folder-avro/2019/07/04/22/source-file-avro+0+0000038153.avro and uploaded to gs://target_bucket/2019/07/04/22/target-file-avro+0+0000038153.avro. Integrity cannot be assured without hashes.

Travis Hobrla · Accepted Answer

S3 multi-part uploads don't have a documented way to calculate the hash; I believe the best you can do is this reverse-engineered answer which requires you to know the part sizes of the original upload. You might be able to glean this from your Kafka-S3 configuration and follow that process to validate integrity.

Unless S3 changes this behavior, if you don't know the original part sizes, then you will never have a way to validate the integrity of a multipart-uploaded S3 object (via gsutil or any other application). So in this scenario, I think the best you can do is assume the risk and perform any other validation possible based on what you know about the data type.

gsutil unable to validate hashes for file uploaded by Kafka Connect S3

Answers (1)

Related Questions