Maor Shmueli
Maor Shmueli

Reputation: 63

AWS Glue Crawler poor performance

We're having some issues with one of our Glue crawlers

Crawler details:

- source: S3 path
- output: Glue table
- mode: incremental
- schedule: daily

We are using the crawler to load the new daily partitions and update the Glue catalog.
The S3 path is as follow: s3://<bucket>/<directory>/
with String partitions:

date
   hour
      customer_id (~800 per hour)

which means that each hour we're having 800 new partitions to load for each customer
in 24 hours total it's 800*24=19.2K partitions.

The crawler worked perfectly for a while but lately, the execution duration increased (~14 hours) and the crawler eventually fails on:

ERROR : Internal Service Exception.

Currently, the Glue table holds 3,384,101 partitions.

Even crawling a small amount of partition takes forever and eventually fails.

I believe it happens due to the high number of partitions we are using but,
is there any way to improve the crawling performance,
so it will deal with the huge amount of partitions added on a daily basis?

Thanks

Upvotes: 1

Views: 871

Answers (0)

Related Questions