Hary S
Hary S

Reputation: 140

Should I run Glue crawler everytime to fetch latest data?

I have a S3 bucket named Employee. Every three hours I will be getting a file in the bucket with a timestamp attached to it. I will be using Glue job to move the file from S3 to Redshift with some transformations. My input file in S3 bucket will have a fixed structure. My Glue Job will use the table created in Data Catalog via crawler as the input.

First run:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "employee_623215", transformation_ctx = "datasource0")

After three hours if I am getting one more file for employee should I crawl it again?

Is there a way to have a single table in Data Catalog like employee and update the table with the latest S3 file which can be used by Glue Job for processing. Or should I run crawler every time to get the latest data? The issue with that is more number of tables will be created in my Data Catalog.

Please let me know if this is possible.

Upvotes: 3

Views: 7685

Answers (2)

Dennis Traub
Dennis Traub

Reputation: 51654

You only need to run the AWS Glue Crawler again if the schema changes. As long as the schema remains unchanged, you can just add files to Amazon S3 without having to re-run the Crawler.

Update: @Eman's comment below is correct

If you are reading from catalog this suggestion will not work. Partitions will not be updated to the catalog table if you do not recrawl. Running the crawler maps those new partitions to the table and allow you to process the next day's partitions.

Upvotes: 4

Shubham Jain
Shubham Jain

Reputation: 5536

An alternative approach can be, instead of reading from catalog read directly from s3 and process data in Glue job.

This way you need not to run crawler again.

Use

from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")

Documented here

Upvotes: 2

Related Questions