Reputation: 3029
I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.
Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.
The schema doesn't change often so it is not necessary to go through all of the files on S3.
Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?
Upvotes: 0
Views: 933
Reputation: 851
Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.
Upvotes: 1