Reputation: 23
I'm trying to transfer some files from a Kafka s3 sink to Google Cloud Storage bucket using gsutil. Because the Kafka-Connect to S3 does the multi-part upload, the Etag of the uploaded files (even though they are small) is not an MD5 hash which is causing the gsutil to throw integrity check warning. I want to know if there is a way to handle the integrity check in such a scenario or I should just ignore the warnings?
I've tried both cp and rsync commands and they both have the same behavior.
gsutil -m cp -r s3://somebucket/folder gs://somebucket/folder
gsutil -m rsync -r s3://somebucket/folder gs://somebucket/folder
Non-MD5 etag ("7dc7e8a64434da88964f3d65f1e05c6b-1") present for key , data integrity checks are not possible.
WARNING: Found no hashes to validate object downloaded from s3://source-bucket/source-folder-avro/2019/07/04/22/source-file-avro+0+0000038153.avro and uploaded to gs://target_bucket/2019/07/04/22/target-file-avro+0+0000038153.avro. Integrity cannot be assured without hashes.
Upvotes: 2
Views: 1710
Reputation: 5509
S3 multi-part uploads don't have a documented way to calculate the hash; I believe the best you can do is this reverse-engineered answer which requires you to know the part sizes of the original upload. You might be able to glean this from your Kafka-S3 configuration and follow that process to validate integrity.
Unless S3 changes this behavior, if you don't know the original part sizes, then you will never have a way to validate the integrity of a multipart-uploaded S3 object (via gsutil
or any other application). So in this scenario, I think the best you can do is assume the risk and perform any other validation possible based on what you know about the data type.
Upvotes: 1