Reputation: 996
With a large number of parts in a file, it is easy to find the proper part size (since there is limited number of permutations of likely part sizes, and part sizes can often be assumed to fall on MiB or MB boundaries).
However, for a given upload, as the number of parts diminishes, a number of different possibilities for part size arise and it becomes difficult to have an algorithm that guesses that number and time-consuming to confirm the calculation.
Knowing the part sizes is useful because the algorithm used for ETag calculation in S3 only generates the same value for two identical objects when the payload matches and the two objects were created using the same part sizes during uploads. Otherwise, it generates different ETag values for two identical objects. (There is no requirement in HTTP itself that two identical objects have the same ETag, but matching ETag values is useful for integrity validation).
Is it possible to learn the part sizes that were used for uploading an existing object in S3?
Upvotes: 4
Views: 3022
Reputation: 996
I will answer my own question as this has been bugging me for a while and I just found out a way to resolve this problem. For copying contents of buckets, most if not all solutions I have seen out there resort to guessing the partsize and simply abandon the idea of having matching etags on source and target buckets. Funny enough aws themselves have published the campanile framework which resorts to guessing the part number, and only assuming it has been copied by aws cli tools.
It turns out there is a documented way of doing this: The aws cli tools has an option to the get-object and head-object apis, which lets you specify which part number you want like this:
aws s3api head-object --bucket YOURBUCKET --key YOURKEY --part-number 1
this'll return you a header that looks like this:
{
"AcceptRanges": "bytes",
"ContentType": "application/octet-stream",
"LastModified": "Mon, 31 Jul 2017 08:23:11 GMT",
"ContentLength": 8388608,
"ETag": "\"XXXX-6\"",
"ServerSideEncryption": "AES256",
"PartsCount": 6,
"Metadata": {}
}
In this case as you can see we are told what the part size for this upload should be through the ContentLength header of part number 1: that is 8 MB, the same size as the one used for uploading this object...
if you use the --debug flag you can see how this is done in the REST world: they simply add-on a url parameter partNumber=1
aws --debug s3api head-object --bucket YOURBUCKET --key YOURKEY --part-number 1
....
2017-07-31 16:21:46,968 - MainThread - botocore.endpoint - DEBUG - Making request for OperationModel(name=HeadObject) (verify_ssl=True) with params:
{'body': '', 'url': u'https://s3.amazonaws.com/YOURKEY/?partNumber=1',
'headers': {'User-Agent': 'aws-cli/1.11.127 Python/2.7.12 Linux/4.4.35-33.55.amzn1.x86_64 botocore/1.5.90'},
'context': {'auth_type': None, 'client_region': 'us-east-1', 'signing': {'bucket': u'YOURBUCKET'}, 'has_streaming_input': False, 'client_config': <botocore.config.Config object at 0x7f20a8e1ff50>},
-----> 'query_string': {u'partNumber': 1}, <-----
'url_path': u'/YOURBUCKET/YOURKEY', 'method': u'HEAD'}
....
the next bit is figuring out how to sign such urls. The aws cli command "aws s3 presign" is unable to do that.
Upvotes: 11