Reputation: 11891
I need to store millions of small JSON objects (around 2,500 bytes each) in AWS S3 and I need to be able to retrieve them three different ways:
The object keys will be organized by Timestamp, so retrieving an object by Timestamp range will be very quick. Also, objects which share the same Timestamp (e.g. same minute) may be concatenated into a single S3 object containing one JSON object per line. Combining improves write performance and also works nicely with EMR and Athena.
However, retrieving by ID will be impossibly slow. I need a way to retrieve large sets of IDs relatively quickly, e.g. Retrieve the timestamps of 100,000 objects (given a list of 100,000 IDs) without scanning the payloads of the entire dataset.
Which AWS service would provide the best way to index the contents of S3 in this scenario?
Upvotes: 2
Views: 7538
Reputation: 179274
The question is certainly on the fringe of opinion-based. I will not venture to claim that is the best solution, but it is a viable solution within the bounds of the "which AWS Service" aspect of the question: RDS for MariaDB is what I use for this exact purpose, with S3 > SNS > Lambda events maintaining the index on RDS, including looking up the object metadata from S3 and storing that, properly normalized and indexed, as well.
The reason S3 > SNS > Lambda instead of just S3 > Lambda is that I have the SNS topic fanning out to both Lambda and an SQS queue, which is read by a "second look" audit process that verifies that everything has been captured correctly.
This is still in limited production use, here, so most of my buckets aren't configured yet... but as of today I have 11,803,039 objects indexed on a t2.micro RDS machine and it's not having any trouble so far... so it's pretty respectable and not expensive.
Upvotes: 4
Reputation: 11891
In the 10 months since posting this question, I experimented with using DynamoDB, and struggled for a while with a MySQL based solution that even went into production but had stability issues. Finally I had some time to refactor and arrived at a solution I had not initially considered: store the indexes as gzip'd JSON files in S3 itself, and cache them in the client that needs to use the index for querying. Obviously there are some requirements around data latency etc that need to be considered, but generally speaking I found this approach to be the simplest with reasonable performance across the use-cases described in the original question.
Upvotes: 6
Reputation: 200889
Any database will work for this. Amazon's DynamoDB database would work quite well since you wouldn't have to manage servers. You could have S3 send an event notification to an AWS Lambda function whenever a new file is added to the bucket. The Lambda function could then parse the file for the information that needs to be stored and indexed in the DB, and then insert it into a DynamoDB table. From there you could query the DynamoDB table, either by ID or by timestamp range, whenever you need to retrieve files from S3.
Upvotes: 0