Yuriy Bondaruk
Yuriy Bondaruk

Reputation: 4750

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

Upvotes: 11

Views: 7236

Answers (2)

Ricardo Mayerhofer
Ricardo Mayerhofer

Reputation: 2309

It's necessary to run the crawler prior to the job.

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added.

Upvotes: 0

RobinL
RobinL

Reputation: 11577

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

Upvotes: 6

Related Questions