cristi.calugaru
cristi.calugaru

Reputation: 601

Boto3 start glue crawler with new s3 input

I have an amazon glue crawler, which looks at a specific s3 location, containing avro files. I have a process which outputs files in a new subfolder of that location.

Once I manually run the crawler, the new subfolder will be seen as a new table in a database, and it will also be is query-able from Athena.

Is there a way I can automate the process, and call the crawler programatically, but only specifying that new subfolder, so that it doesn't have to scan the entire parent folder structure? I want to add tables to a databases, and not partitions to an existing table.

I was looking for a Python option, and I have seen indeed that one can do:

import boto3
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='avro-crawler')

I haven't seen an option to pass a folder to limit where the crawler is looking into. Because there are hundreds of folders/tables in that location, re-crawling everything takes a long time, which I'm trying to avoid.

What are my options here? Would I need to programatically create a new crawler with each new subfolder added to s3?

Or create a lambda function which gets triggered when a new subfolder gets added to s3? I've seen an answer here , but even with lambda, it still implies I call the start_crawler, which would crawl everything?

Thanks for any suggestions.

Upvotes: 1

Views: 8336

Answers (1)

Kishore
Kishore

Reputation: 86

Update crawler_name to your crawler_name and update_path to your update path.

response = glue_client.update_crawler(Name=crawler_name,
                           Targets={'S3Targets': [{'Path':update_path}]})

Upvotes: 4

Related Questions