How to ensure guaranteed delivery of S3 event?

Question

TL;dr - How to ensure guaranteed delivery of S3 events where the guaranteed delivery is determined by both Lambda and Glue?

I am totally new to AWS. I am trying to implement a data processing job that processes newly added data to S3. I am running a Glue PySpark job to process newly added data put in S3.

Here's what the current data flow I created looks like:

S3 Event > Lambda > Glue job

S3 event is sent to a lambda function, lambda function starts the glue job and passes the bucket details from the event. Glue job processes new data from that bucket and object key.

However, I am trying to make my system resilient and there are some issues with the above data flow.

S3 event will be lost if lambda is not available.
S3 event will be lost if glue job fails for an unexpected reason or does not run at all. Ideally the message should be re-processed if it was not processed successfully by Glue.

To deal with this issue, I am thinking of adding SQS queue in the flow. So the new flow would look like this:

S3 event > SQS > Lambda > Glue Job

I am considering above flow so that:

The event remains in the SQS if lambda was not available
Lambda can start processing all the missed events once it's back and running

However, the spot where I am seeking help is, how to deal with glue failures?

I am thinking of this scenario:

S3 event is sent to SQS, lambda successfully picked it up
Lambda starts the glue job, but the glue job is fails to start or fails unexpectedly after running

In the scenario above, the S3 event should go back to the SQS to be picked up later by the lambda again.

I am thinking to write a logic in glue job to remove the event manually from SQS as the last step of the job. Meaning lambda picks up the event but glue removes it as the cycle is then complete.

But that would generate an unexpected behavior where the lambda would pick up same event twice because it has not been removed by the glue job still running.

I need some advice on how to tackle with this issue and make my system robust and not ambiguous.

Some background on why choosing this flow -

The S3 data bucket is inside another account
As per requirements, I am not allowed to make modifications to it
I can only rely on S3 Put event notifications to start my process

How to ensure guaranteed delivery of S3 event?

Answers (1)

Related Questions