Laura Galera
Laura Galera

Reputation: 119

Re-run failed glue job using a step function

I basically want to retry a glue job twice after status FAILED or TIMEOUT, before moving to the next stage. My state machine looks like this:

 {
  "Comment": "A description of my state machine",
  "StartAt": "Glue StartJobRun",
  "States": {
    "Glue StartJobRun": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun",
      "Parameters": {
        "JobName": "test-to-fail"
      },
      "End": true,
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "MaxAttempts": 2,
          "IntervalSeconds": 300,
          "BackoffRate": 1
        }
      ]
    }
  }
}

Nevertheless, the job runs one single time when I execute the state machine. Could it be that I misunderstood the logic behind Retry and it applies to the step function state and not the job? I followed this tutorial, which seems to confirm it is the job status. Then, what is wrong? And how can I achieve this?

Thank you!

Upvotes: 0

Views: 541

Answers (1)

Rohit Kumar
Rohit Kumar

Reputation: 1321

Actually you getting retry logic wrong

"Glue StartJobRun": { "Type": "Task", "Resource": "arn:aws:states:::glue:startJobRun",

This step will be considered failed if Glue job is not started. And it will retry to start it. As job was started, it is considered success.

Solution: But to check status of the job you should have next step with retry attempts.

  • configure if job failed retry

somehow like this..

{
  "Comment": "A state machine that starts a Glue job and checks its status",
  "StartAt": "Start Glue Job",
  "States": {
    "Start Glue Job": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "YourGlueJobName"
      },
      "Next": "Check Job Status"
    },
    "Check Job Status": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:getJobRun",
      "Parameters": {
        "JobName": "YourGlueJobName",
        "RunId.$": "$.RunId"
      },
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "End": true
    }
  }
}

Upvotes: 1

Related Questions