Smart sampling with AWS Glue Crawlers

Question

I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.

Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.

The schema doesn't change often so it is not necessary to go through all of the files on S3.

Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?

Smart sampling with AWS Glue Crawlers

Answers (1)

Related Questions