Reputation: 45
I would like to set my glue crawler to only crawl new folders in my s3 bucket. Based on documentation, it looks like I want to set the RecrawlBehavior to CRAWL_NEW_FOLDERS_ONLY. But I can't find any guidance on how to do that in a CloudFormation template.
This is my crawler's configuration property now, but my use of RecrawlBehavior is invalid:
Configuration: "{\"Version\":1.0,\"RecrawlBehavior\":\"CRAWL_NEW_FOLDERS_ONLY\",\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
Upvotes: 2
Views: 2958
Reputation: 159
It's supported now:
Crawler:
Type: AWS::Glue::Crawler
Properties:
...
...
RecrawlPolicy:
RecrawlBehavior: CRAWL_NEW_FOLDERS_ONLY
Also consider that if CRAWL_NEW_FOLDERS_ONLY
is set, then the only schema change behaviour available is LOG for update or delete.
Upvotes: 0
Reputation: 2971
As per my understanding, Incremental policy is a relatively new feature in Glue and not supported in Cloud Formation yet.
A workaround I can suggest to overcome this limitation is creating a crawler using cloudformation and then use AWS CLI to update its RecrawlPolicy property.
When you create a crawler using cloudformation and try to retrieve its properties using CLI, RecrawlPolicy" has "RecrawlBehavior" set to "CRAWL_EVERYTHING". You can use the below command to change it to incremental crawls (Crawl new folders only).
aws glue update-crawler
--name <crawlername>
--recrawl-policy '{"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"}'
--schema-change-policy '{"UpdateBehavior":"LOG","DeleteBehavior":"LOG"}'
Upvotes: 3