Reputation: 35
TL;dr - How to ensure guaranteed delivery of S3 events where the guaranteed delivery is determined by both Lambda and Glue?
I am totally new to AWS. I am trying to implement a data processing job that processes newly added data to S3. I am running a Glue PySpark job to process newly added data put in S3.
Here's what the current data flow I created looks like:
S3 Event > Lambda > Glue job
S3 event is sent to a lambda function, lambda function starts the glue job and passes the bucket details from the event. Glue job processes new data from that bucket and object key.
However, I am trying to make my system resilient and there are some issues with the above data flow.
To deal with this issue, I am thinking of adding SQS queue in the flow. So the new flow would look like this:
S3 event > SQS > Lambda > Glue Job
I am considering above flow so that:
However, the spot where I am seeking help is, how to deal with glue failures?
I am thinking of this scenario:
In the scenario above, the S3 event should go back to the SQS to be picked up later by the lambda again.
I am thinking to write a logic in glue job to remove the event manually from SQS as the last step of the job. Meaning lambda picks up the event but glue removes it as the cycle is then complete.
But that would generate an unexpected behavior where the lambda would pick up same event twice because it has not been removed by the glue job still running.
I need some advice on how to tackle with this issue and make my system robust and not ambiguous.
Some background on why choosing this flow -
Upvotes: 0
Views: 526
Reputation: 238051
I would look at Step Functions
(SF), as they have AWS Glue task which can be synchronous
.
I think you could orchestrate your work using SF with the sync
Glue task. In your SF you would have a branch based on Success or Failure of the glue task. This would lead to either re-sending new msg to the SQS, or removing it from the queue.
Upvotes: 1