Tom
Tom

Reputation: 143

Amazon S3, Syncing, Modified date vs. Uploaded Date

We're using the AWS SDK for .NET and I'm trying to pinpoint where we seem to be having a sync problem with our consumer applications. Basically we have a push-service that generates changeset files that get uploaded to S3, and our consumer applications are supposed to download these files and apply them in order to sync up to the correct state, which is not happening.

There's some conflicting views on what/where the correct datestamps are represented. Our consumers were written to look at the s3 file's "LastModified" field to sort the downloaded files for processing, and I don't know anymore what this field represents. At first I thought it represented the date modified/created of the file we uploaded, then (as seen here) it actually represents a new date stamp of when the file was uploaded, and likewise in the same link it seems to imply that when a file is downloaded it reverts back to the old datestamp (but I cannot confirm this).

We're using this snippet of code to pull files

// Get a list of the latest changesets since the last successful full update.
Amazon.S3.AmazonS3Client client = ...;

List<Amazon.S3.Model.S3Object> listObjects = client.GetFullObjectList(
    this.Settings.GetS3ListObjectsRequest(this.Settings.S3ChangesetSubBucket), 
    Amazon.S3.AmazonS3Client.DateComparisonType.GreaterThan, 
    lastModifiedDate, 
    Amazon.S3.AmazonS3Client.StringTokenComparisonType.MustContainAll, 
    this.Settings.RequiredChangesetPathTokens);

And then sort by the S3Object's LastModified (which I think is where our assumption is wrong)

foreach (Amazon.S3.Model.S3Object obj in listObjects)
{
    if (DateTime.Parse(obj.LastModified) > lastModifiedDate)
    {
        //it's a new file, so we use insertion sort to put this file in an ordered list
        //based on LastModified
    }
}

Am I correct in assuming that we should be doing something more to preserve our own datestamps that we need, such as using custom header/metadata objects to put the correct datestamps on files that we need, or even putting it in the filename itself?

EDIT

Perhaps this question can answer my problem: If my service has 2 files to upload to S3 and goes through the process of doing that, am I guaranteed that these files show up in S3 in the order they were uploaded (via LastModified) or does S3 do some amount of asynchronous processing that could lead to my files showing up in a list of S3 object out of order? I'm worried about a case where, for example, my service uploaded files A then B, B shows up first in S3, my consumers get + process B, then A shows up, and then my consumers may or may not get A and incorrectly process it thinking it's newer when it's not?

EDIT 2

It was as I and the person below suspected and we had some racing conditions trying to apply changesets in order while blindly relying on S3's datestamps. As an addendum, we ended up making 2 fixes to try and address the problem, which might be useful for others as well:

Firstly, to address to the race condition between when our uploads finish and the modified dates reported by S3, we decided to make all our queries look into the past by 1 second from the last date modified we read from a pulled file in S3. In examining this fix we saw another problem in S3 that wasn't apparent before, namely that S3 does not preserve milliseconds on timestamps, but rather rounded them up to the next second for all its timestamps. Looking back in time by 1 second circumvented this.

Secondly, since we were looking back in time we would have the problem of downloading the same file multiple times if there weren't any new changeset files to download, so we added a filename buffer for files we saw in our last request, skipped any files we had already seen, and refreshed the buffer when we saw new files.

Hope this helps.

Upvotes: 5

Views: 7564

Answers (1)

dcro
dcro

Reputation: 13679

When listing objects in an S3 bucket, the API response received from S3 will always return them in alphabetical order.

The S3 API does not allow you to filter or sort objects based on the LastModified value. Any such filtering or sorting is done exclusively in the client libraries that you use to connect to S3.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

As for the accuracy of the LastModified value and it's possible use to sort the list of objects based on the time they were uploaded, to my knowledge, the LastModified value is set to the time the upload finishes (when the server returns a 200 OK response) and not the time the upload was started.

This means that if you start upload A that's 100MB in size and a second later you start upload B that's only 1K in size, in the end, the last modified timestamp for A will be after the last modified timestamp for B.

If you need to preserve the time your upload was started, it's best to use a custom metadata header with your original PUT request.

Upvotes: 6

Related Questions