Reputation: 4750
As per AWS StepFunction documentation it's possible to configure retries per error but I'm wondering if it's possible to use details from the error message to define retry strategy?
In my case I'm triggering Glue ETL job which may fail with custom exception NoDataLoadedException
so I'd like to recognize it and do not retry. Here is my task definition (first Retry
block never happens):
"ExecuteEtl": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": "$.jobName",
"Arguments.$": "$.jobArguments"
},
"Retry" : [{
"ErrorEquals": [ "NoDataLoadedException" ],
"MaxAttempts": 0
},{
"ErrorEquals": [ "States.Timeout", "States.TaskFailed", "States.Runtime" ],
"MaxAttempts": 4,
"IntervalSeconds": 60,
"BackoffRate": 2
}],
"Next": "ExtractGlueJobExecutionId"
}
Here is failure output:
{
"resourceType": "glue",
"resource": "startJobRun.sync",
"error": "{\"AllocatedCapacity\":10,\"Arguments\":{},\"Attempt\":0,\"CompletedOn\":1549662956476,\"ErrorMessage\":\" NoDataLoadedException No data loaded from...",
"cause": "States.TaskFailed"
}
Is it possible to use error.ErrorMessage
to identify retry strategy for the task?
Upvotes: 4
Views: 4364
Reputation: 624
When a glue job fails in step function, the error name is not propagated to the Error
output of the stepfunction state (probably a missing feature on AWS stepfunctions with glue integration...). Instead, we get the wildcard States.TaskFailed
, which makes it impossible to easily distinguish between different error types. However, it is possible to design a work around like this by parsing the error and passing it to a wrapping state that can then act depending on the error type.
{
"StartAt": "ParallelState",
"States": {
"ParallelState": {
"Type": "Parallel",
"End": true,
"Retry": [
{
"ErrorEquals": [
"RetryableException"
],
"MaxAttempts": 2
}
],
"Branches": [
{
"StartAt": "glue",
"States": {
"glue": {
"End": true,
"Type": "Task",
"InputPath": null,
"Catch": [
{
"ErrorEquals": [
"States.TaskFailed"
],
"Next": "CatchAllFallback"
}
],
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "MyGlueJob",
}
},
"CatchAllFallback": {
"Type": "Pass",
"Parameters": {
"Cause.$": "States.StringToJson($.Cause)"
},
"ResultPath": "$.parsedError",
"Next": "ChoiceFail"
},
"ChoiceFail": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.parsedError.Cause.ErrorMessage",
"StringMatches": "RetryableException*",
"Next": "failRetry"
}
],
"Default": "fail"
},
"failRetry": {
"Type": "Fail",
"Error": "RetryableException"
},
"fail": {
"Type": "Fail"
}
}
}
]
}
}
}
In essence, the catch state parses the error message, and forwards it to a choice state which decides if the exception is one we'd like to retry or not (or do other things with). In my case, I named that exception RetryableException
. If we need to retry it, move to a Fail state, that can set the Error
field as one wishes. If we don't want to retry, we move to a generic fail state.
The surrounding parallel state, which is only useful for its "Retry" section and not its parallel features (we only have one branch), can then catch that error in the "ErrorEquals" and retry the whole block or apply any logic one likes.
This workaround will hopefully become useless once AWS properly propagates custom glue job errors to the Error
field of the step function state.
Upvotes: 1
Reputation: 162
I had the same issue.
I wanted to stop the work flow in the Step Functions if my Glue Job didn´t write any data in S3. What I ended up doing was creating a Lambda right after the Glue job, to check if files were written in the bucket under a specific timestamp partition. The Lambda can return a value (true, false; if there´s data or not) to the Step Functions and change the workflow with a Type Choice task.
"If_Data": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.input",
"StringEquals": "True",
"Next": "TableCrawler"
}
],
"Default": "FinishedNoData"
},
Upvotes: 0
Reputation: 8074
Add the NoDataLoadedException
error into a Catch
block. In it, you can define the Next
step. This should work:
"ExecuteEtl": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": "$.jobName",
"Arguments.$": "$.jobArguments"
},
"Retry" : [{
"ErrorEquals": [ "States.Timeout", "States.TaskFailed", "States.Runtime" ],
"MaxAttempts": 4,
"IntervalSeconds": 60,
"BackoffRate": 2
}],
"Catch": [{
"ErrorEquals": [ "NoDataLoadedException" ],
"Next": "NoDataStep"
}],
"Next": "ExtractGlueJobExecutionId"
}
Because the NoDataLoadedException
won't be handled by the Retry block, it will fall into the Catch, which is where you can react to it.
Upvotes: 1