pushkargr
pushkargr

Reputation: 23

gsutil unable to validate hashes for file uploaded by Kafka Connect S3

I'm trying to transfer some files from a Kafka s3 sink to Google Cloud Storage bucket using gsutil. Because the Kafka-Connect to S3 does the multi-part upload, the Etag of the uploaded files (even though they are small) is not an MD5 hash which is causing the gsutil to throw integrity check warning. I want to know if there is a way to handle the integrity check in such a scenario or I should just ignore the warnings?

I've tried both cp and rsync commands and they both have the same behavior.

gsutil -m cp -r s3://somebucket/folder gs://somebucket/folder
gsutil -m rsync -r s3://somebucket/folder gs://somebucket/folder

Non-MD5 etag ("7dc7e8a64434da88964f3d65f1e05c6b-1") present for key , data integrity checks are not possible.

WARNING: Found no hashes to validate object downloaded from s3://source-bucket/source-folder-avro/2019/07/04/22/source-file-avro+0+0000038153.avro and uploaded to gs://target_bucket/2019/07/04/22/target-file-avro+0+0000038153.avro. Integrity cannot be assured without hashes.

Upvotes: 2

Views: 1710

Answers (1)

Travis Hobrla
Travis Hobrla

Reputation: 5509

S3 multi-part uploads don't have a documented way to calculate the hash; I believe the best you can do is this reverse-engineered answer which requires you to know the part sizes of the original upload. You might be able to glean this from your Kafka-S3 configuration and follow that process to validate integrity.

Unless S3 changes this behavior, if you don't know the original part sizes, then you will never have a way to validate the integrity of a multipart-uploaded S3 object (via gsutil or any other application). So in this scenario, I think the best you can do is assume the risk and perform any other validation possible based on what you know about the data type.

Upvotes: 1

Related Questions