Yuriy Bondaruk
Yuriy Bondaruk

Reputation: 4750

Retry StepFunction task based on error message

As per AWS StepFunction documentation it's possible to configure retries per error but I'm wondering if it's possible to use details from the error message to define retry strategy?

In my case I'm triggering Glue ETL job which may fail with custom exception NoDataLoadedException so I'd like to recognize it and do not retry. Here is my task definition (first Retry block never happens):

"ExecuteEtl": {
     "Type": "Task",
     "Resource": "arn:aws:states:::glue:startJobRun.sync",
     "Parameters": {
       "JobName.$": "$.jobName",
        "Arguments.$": "$.jobArguments"
      },
      "Retry" : [{
        "ErrorEquals": [ "NoDataLoadedException" ],
        "MaxAttempts": 0
      },{
        "ErrorEquals": [ "States.Timeout", "States.TaskFailed", "States.Runtime" ],
        "MaxAttempts": 4,
        "IntervalSeconds": 60,
        "BackoffRate": 2
      }],
      "Next": "ExtractGlueJobExecutionId"
}

Here is failure output:

{
  "resourceType": "glue",
  "resource": "startJobRun.sync",
  "error": "{\"AllocatedCapacity\":10,\"Arguments\":{},\"Attempt\":0,\"CompletedOn\":1549662956476,\"ErrorMessage\":\" NoDataLoadedException No data loaded from...",
  "cause": "States.TaskFailed"
}

Is it possible to use error.ErrorMessage to identify retry strategy for the task?

Upvotes: 4

Views: 4364

Answers (3)

mananony
mananony

Reputation: 624

When a glue job fails in step function, the error name is not propagated to the Error output of the stepfunction state (probably a missing feature on AWS stepfunctions with glue integration...). Instead, we get the wildcard States.TaskFailed, which makes it impossible to easily distinguish between different error types. However, it is possible to design a work around like this by parsing the error and passing it to a wrapping state that can then act depending on the error type.

{
    "StartAt": "ParallelState",
    "States": {
        "ParallelState": {
            "Type": "Parallel",
            "End": true,
            "Retry": [
                {
                    "ErrorEquals": [
                        "RetryableException"
                    ],
                    "MaxAttempts": 2
                }
            ],
            "Branches": [
                {
                    "StartAt": "glue",
                    "States": {
                        "glue": {
                            "End": true,
                            "Type": "Task",
                            "InputPath": null,
                            "Catch": [
                                {
                                    "ErrorEquals": [
                                        "States.TaskFailed"
                                    ],
                                    "Next": "CatchAllFallback"
                                }
                            ],
                            "Resource": "arn:aws:states:::glue:startJobRun.sync",
                            "Parameters": {
                                "JobName": "MyGlueJob",
                            }
                        },
                        "CatchAllFallback": {
                            "Type": "Pass",
                            "Parameters": {
                                "Cause.$": "States.StringToJson($.Cause)"
                            },
                            "ResultPath": "$.parsedError",
                            "Next": "ChoiceFail"
                        },
                        "ChoiceFail": {
                            "Type": "Choice",
                            "Choices": [
                                {
                                    "Variable": "$.parsedError.Cause.ErrorMessage",
                                    "StringMatches": "RetryableException*",
                                    "Next": "failRetry"
                                }
                            ],
                            "Default": "fail"
                        },
                        "failRetry": {
                            "Type": "Fail",
                            "Error": "RetryableException"
                        },
                        "fail": {
                            "Type": "Fail"
                        }
                    }
                }
            ]
        }
    }
}

In essence, the catch state parses the error message, and forwards it to a choice state which decides if the exception is one we'd like to retry or not (or do other things with). In my case, I named that exception RetryableException. If we need to retry it, move to a Fail state, that can set the Error field as one wishes. If we don't want to retry, we move to a generic fail state.

The surrounding parallel state, which is only useful for its "Retry" section and not its parallel features (we only have one branch), can then catch that error in the "ErrorEquals" and retry the whole block or apply any logic one likes.

This workaround will hopefully become useless once AWS properly propagates custom glue job errors to the Error field of the step function state.

Upvotes: 1

Nico Arbar
Nico Arbar

Reputation: 162

I had the same issue.

I wanted to stop the work flow in the Step Functions if my Glue Job didn´t write any data in S3. What I ended up doing was creating a Lambda right after the Glue job, to check if files were written in the bucket under a specific timestamp partition. The Lambda can return a value (true, false; if there´s data or not) to the Step Functions and change the workflow with a Type Choice task.

       "If_Data": {
          "Type": "Choice",
          "Choices": [
            {
              "Variable": "$.input",
              "StringEquals": "True",
              "Next": "TableCrawler"
            }
          ],
          "Default": "FinishedNoData"
        },

Upvotes: 0

Milan Cermak
Milan Cermak

Reputation: 8074

Add the NoDataLoadedException error into a Catch block. In it, you can define the Next step. This should work:

"ExecuteEtl": {
        "Type": "Task",
        "Resource": "arn:aws:states:::glue:startJobRun.sync",
        "Parameters": {
            "JobName.$": "$.jobName",
            "Arguments.$": "$.jobArguments"
        },
        "Retry" : [{
            "ErrorEquals": [ "States.Timeout", "States.TaskFailed", "States.Runtime" ],
            "MaxAttempts": 4,
            "IntervalSeconds": 60,
            "BackoffRate": 2
        }],
        "Catch": [{
            "ErrorEquals": [ "NoDataLoadedException" ],
            "Next": "NoDataStep"
        }],
        "Next": "ExtractGlueJobExecutionId"
    }

Because the NoDataLoadedException won't be handled by the Retry block, it will fall into the Catch, which is where you can react to it.

Upvotes: 1

Related Questions