dragonachu
dragonachu

Reputation: 551

How to include AWS Glue crawler in Step Function

This is my requirement: I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.

Questions:

  1. How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
  2. How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
  3. Is there any way I can schedule the Step Function State Machine to run at a particular time?

References:

Upvotes: 9

Views: 9321

Answers (2)

Naiara Cerqueira
Naiara Cerqueira

Reputation: 21

Here is the post of the configuration you need, make sure you added the rest of the configuration as at the end I have used ... to show that there should be a continuation.

{
      "StartAt": "crawler",
      "States": {
        "crawler_name": {
          "Type": "Task",
          "Parameters": {
            "Name": "crawler"
          },
          "Resource": "arn:aws:states:::aws-sdk:glue:startCrawler",
          "Next": "crawler_info",
          "Retry": [
            {
              "ErrorEquals": [
                "States.ALL"
              ],
              "BackoffRate": 2,
              "IntervalSeconds": 10,
              "MaxAttempts": 2
            }
          ]
        },
        "crawler_info": {
          "Type": "Task",
          "Next": "crawler_status",
          "Parameters": {
            "Name": "crawler"
          },
          "Resource": "arn:aws:states:::aws-sdk:glue:getCrawler",
          "Retry": [
            {
              "ErrorEquals": [
                "States.ALL"
              ],
              "BackoffRate": 2,
              "IntervalSeconds": 10,
              "MaxAttempts": 3
            }
          ]
        },
        "crawler_status": {
          "Type": "Choice",
          "Choices": [
            {
              "Variable": "$.Crawler.State",
              "StringEquals": "FAILED",
              "Next": "crawler_failed"
            },
            {
              "Variable": "$.Crawler.State",
              "StringEquals": "RUNNING",
              "Next": "crawler_finish_wait"
            },
            {
              "Variable": "$.Crawler.State",
              "StringEquals": "STOPPING",
              "Next": "crawler_finish_wait"
            },
            {
              "Variable": "$.Crawler.State",
              "StringEquals": "SUCCESS",
              "Next": "glue_job"
            }
          ],
          "Default": "glue_job"
        },
        "crawler_finish_wait": {
          "Type": "Wait",
          "Seconds": 10,
          "Next": "crawler_info"
        },
        "crawler_failed": {
          "Type": "Fail"
        },
        "glue_job": {
          "Type": "Task",
          ...
        }
        ...
}

to schedule, use a Eventbridge scheduler :)

Upvotes: 2

Frosty
Frosty

Reputation: 698

A few months late to answer this but this can be done from within the step function. You can create the following states to achieve it:

  • TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
  • PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
  • IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
  • RunGlueJob: Task State: A Lambda function that triggers the glue job.
  • WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
  • Finish: Succeed State.

Here is how this Step Function will look like:

enter image description here

Upvotes: 5

Related Questions