AJAX86
AJAX86

Reputation: 45

How to set the Glue Crawler RecrawlPolicy in my CF template

I would like to set my glue crawler to only crawl new folders in my s3 bucket. Based on documentation, it looks like I want to set the RecrawlBehavior to CRAWL_NEW_FOLDERS_ONLY. But I can't find any guidance on how to do that in a CloudFormation template.

This is my crawler's configuration property now, but my use of RecrawlBehavior is invalid:

Configuration: "{\"Version\":1.0,\"RecrawlBehavior\":\"CRAWL_NEW_FOLDERS_ONLY\",\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Upvotes: 2

Views: 2958

Answers (2)

Rodrigo Ibañez
Rodrigo Ibañez

Reputation: 159

It's supported now:

Crawler:
    Type: AWS::Glue::Crawler
    Properties:
      ...
      ...
      RecrawlPolicy:
        RecrawlBehavior: CRAWL_NEW_FOLDERS_ONLY

Also consider that if CRAWL_NEW_FOLDERS_ONLY is set, then the only schema change behaviour available is LOG for update or delete.

Upvotes: 0

nikoo28
nikoo28

Reputation: 2971

As per my understanding, Incremental policy is a relatively new feature in Glue and not supported in Cloud Formation yet.

A workaround I can suggest to overcome this limitation is creating a crawler using cloudformation and then use AWS CLI to update its RecrawlPolicy property.

When you create a crawler using cloudformation and try to retrieve its properties using CLI, RecrawlPolicy" has "RecrawlBehavior" set to "CRAWL_EVERYTHING". You can use the below command to change it to incremental crawls (Crawl new folders only).

aws glue update-crawler 
    --name <crawlername> 
    --recrawl-policy '{"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"}' 
    --schema-change-policy '{"UpdateBehavior":"LOG","DeleteBehavior":"LOG"}'

Upvotes: 3

Related Questions