Reputation: 601
I have an amazon glue crawler, which looks at a specific s3 location, containing avro files. I have a process which outputs files in a new subfolder of that location.
Once I manually run the crawler, the new subfolder will be seen as a new table in a database, and it will also be is query-able from Athena.
Is there a way I can automate the process, and call the crawler programatically, but only specifying that new subfolder, so that it doesn't have to scan the entire parent folder structure? I want to add tables to a databases, and not partitions to an existing table.
I was looking for a Python option, and I have seen indeed that one can do:
import boto3
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='avro-crawler')
I haven't seen an option to pass a folder to limit where the crawler is looking into. Because there are hundreds of folders/tables in that location, re-crawling everything takes a long time, which I'm trying to avoid.
What are my options here? Would I need to programatically create a new crawler with each new subfolder added to s3?
Or create a lambda function which gets triggered when a new subfolder gets added to s3? I've seen an answer here , but even with lambda, it still implies I call the start_crawler, which would crawl everything?
Thanks for any suggestions.
Upvotes: 1
Views: 8336
Reputation: 86
Update crawler_name to your crawler_name and update_path to your update path.
response = glue_client.update_crawler(Name=crawler_name,
Targets={'S3Targets': [{'Path':update_path}]})
Upvotes: 4