harshit_sharan
harshit_sharan

Reputation: 141

Hourly data flow from SQS to S3

I have a use case which has to follow the following steps:

  1. Read messages from an AWS SQS queue
  2. Process received data and enhance it with some data obtained from other pull based sources
  3. Make the enhanced data available in AWS S3, in prefixes at an hourly cadence

Basically the major ask is how and where to buffer the data for an hour and then write to S3 once every hour only, and not write as soon as a message is received from SQS. The buffering cannot be done in-memory, as the number of messages received will be very large.

P.S. AWS Firehose is not an option since it doesn't ensure complete de-duplication of data written in S3, i.e. if client side failure occurs while sending write request to S3, the same data maybe written again. We want completely non-duplicate data is S3.

Let me know of solution to this problem, and if there is a pre existing tech stack and/or system that accomplishes this.

Thanks!

Upvotes: 4

Views: 5956

Answers (1)

JaredHatfield
JaredHatfield

Reputation: 6671

I recently worked on implementing a AWS Lambda function that is scheduled to run periodically using CloudWatch Events and consume messages from an SQS queue and send them to Kinesis Firehose so they can be stored in S3.

I would still recommend using AWS Firehose for this use case. AWS solves lots of very complex scalability and availability problems and mask them behind a deceivingly simple API.

To address your point on de-duplication, it is important to understand You Cannot Have Exactly-Once Delivery. You can have at least once, you can have at most once, but it is impossible to have exactly once. You can attempt to implement this algorithm yourself, but it will be wrong (since it is not possible). For many people, including myself, it is good enough to trust AWS's implementation as they provide very high quality services and APIs.

As for meeting your requirements you could schedule an AWS Lambda function to run every hour and consume the SQS messages, perform the additional processing, and send them along to AWS Firehose. You can configure AWS Firehose to have the maximum delivery time and size hints so that it creates the minimum number of files. This will have the effect of having data be delayed roughly 1 hour and 15 minutes, but it will create the file in S3 roughly every hour based on the interval of the CloudWatch scheduled event.

This is not a pre-existing technology, but the code required to implement the AWS Lambda function is very simple. You simple read messages from SQS, do your additional enhancement to the records, write them to AWS Firehose, and finally delete the messages from SQS.

Upvotes: 4

Related Questions