Balu Vyamajala
Balu Vyamajala

Reputation: 10333

Aws StepFunction Notification in between Retry attempts

Aws Step function , lets say we have a Task with below Retry Logic, it is going to retry for 6 times, after 10 seconds, 1 min, 6 mins, 36 mins, 3.6 hours and finally after 21.6 hours. However, I would like to send out a notification directly to SNS or set an Alarm, when retry failed after 4 attempts, to take an action and resolve backend issues. Is it possible with step Retry? I tried to look at Cloudwatch logs of step and Lambda to see if there is any difference in the logs to identify how long it was failing for or how many attempts it was tried, etc to create filters. I didn't find any good solution yet. is there anything that I can try?

{
   "Type":"Task",
   "Resource":"${MyLambda}",
   "End":true,
   "Retry":[
      {
         "ErrorEquals":[
            "States.ALL"
         ],
         "IntervalSeconds":10,
         "MaxAttempts":6,
         "BackoffRate":6
      }
   ]
},

Upvotes: 2

Views: 2294

Answers (2)

Vladislav Tupikin
Vladislav Tupikin

Reputation: 124

It took me about 4 hours to find a solution to a problem similar to yours.

I needed to send an email notification after the first unsuccessful attempt. Here's what I did to make it work:

{
   "Type": "Task",
   "Resource": "${MyLambda}",
   "End": true,
   "Parameters": {
     "retryCount.$": "$$.State.RetryCount"
   },
   "Retry": [
      {
         "ErrorEquals": [
            "States.ALL"
         ],
         "IntervalSeconds": 10,
         "MaxAttempts": 6,
         "BackoffRate": 6
      }
   ]
},

And in your lambda function you write something like that:

export const handler = (event) => {
  if (event.retryCount) {
    // send email notification
  }
};

I took info from AWS docs: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-contextobject.html

Upvotes: 3

odenS0n
odenS0n

Reputation: 189

I'm afraid the functionality you are looking for isn't offered in step function retry logic. I can think of two potential workarounds.

Option 1

Have a lambda that gets triggered by error CloudWatch logs from your step function lambdas (you can create a subscription filter following this example). This lambda will get all running executions of your step function, and alert if any have been running longer than a specified time.

Option 2

In your step function lambdas, wrap your error logs with step function arn & execution id (one way to get these in your lambda is through the context object). Have a separate lambda that gets triggered by error CloudWatch logs from your step function lambdas. Using the step function arn & execution id, this lambda can perform alerting based on how long the step function execution has been in a running state.

Example client calls (other clients should also offer similar methods)

  • Boto3 method you can use to get all running step function executions

*Sadly, client methods to get step function executions only return start DATE (not time). If you can create a naming standard for your step function executions, then you can extrapolate start time from the name of the execution itself. (this may also be a good way to avoid running into an error seen when trying to invoke a step function with a duplicate execution name)

Hope this helps!

Upvotes: 1

Related Questions