Reputation: 416
I'm a developer fairly new to AWS and S3 that has been tasked with iterating over every object in a large, active S3 bucket and performing processing on the individual object summaries, but not the objects themselves.
The S3 bucket continually has new objects being written to it and existing objects read, but no deletions are taking place or existing objects being updated. We can assume that any new objects arriving in S3 have already been processed so we only need to worry about the historical data. Processing already-processed data is not optimal, but acceptable if there is no effective way around it. New objects are arriving at an unknown rate that could be higher or lower than the rate at which a single thread could make a ListObjectsV2Request
, process the object summaries, and retrieve the next page of results.
The number of objects within the S3 bucket is ~100 000 000 (100 million), far too large for a single ListObjectsV2Request
request and would be too large for memory if all object summaries could be retrieved in a single request.
Is there any way to take a "snapshot" of the current objects within the bucket and perform my processing on that? Failing that, does the pagination supported by the AWS v2 SDK operate on a "snapshot" of when the first request was made or will it continually feed in new objects as they are written to the bucket after the first request?
I.e. Is it possible to perform
ListObjectsV2Request request = new ListObjectsV2Request.withBucketName(myBucket).withMaxKeys(MAX_KEYS);
ListObjectsV2Result result;
do {
result = s3Client.listObjectsV2(request);
processObjectSummaries(result.getObjectSummaries());
String continuationToken = result.getNextContinuationToken();
request.setContinuationToken(token);
} while (result.isTruncated());
and process my historical data with minimal re-processing of new data?
Upvotes: 0
Views: 1453
Reputation: 13187
Create an inventory of all the objects in your S3 Bucket using S3 inventory.
This allows you to get a CSV, ORC or Parquet file that contains a list of all the objects currently in your bucket. You can then use Python, Athena or whatever tool you choose to work with that data and do aggregations.
It is eventually consistent, but it contains a LastModified timestamp for each object, so you can use a cutoff-date to distinguish between the historical data and any data you process through an SQS-Lambda integration.
If you need to re-process parts of the data, you should also be aware of S3 Batch operations, which help you with that.
Upvotes: 2