Reputation: 1335

AWS Glue does not detect partitions and creates 1000+ tables in catalog

I am using AWS Glue to create metadata tables.

AWS Glue Crawler data store path: s3://bucket-name/

Bucket structure in S3 is like

├── bucket-name        
│   ├── pt=2011-10-11-01     
│   │   ├── file1                    
|   |   ├── file2                                        
│   ├── pt=2011-10-11-02               
│   │   ├── file1          
│   ├── pt=2011-10-10-01           
│   │   ├── file1           
│   ├── pt=2011-10-11-10              
│   │   ├── file1

for this aws crawler create 4 tables.

My question is why aws glue crawler does not detect partition?

Upvotes: 10

Answers (5)

Asclepius

Reputation: 63516

There are two things I needed to do to get AWS Glue to avoid creating extraneous tables. This was tested with boto3 1.17.46.

Firstly, ensure an S3 object structure such as this:

s3://mybucket/myprefix/mytable1/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable2/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable3/<nested_partition>/<name>.xyz

Secondly, if using boto3, create the crawler with the arguments:

targets = [{"Path": f"s3://mybucket/myprefix/mytable{i}/"} for i in (1, 2, 3)]
config = {"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas"}}

boto3.client("glue").create_crawler(Targets={"S3Targets": targets}, Configuration=json.dumps(config))

As per Targets, each table's path is provided as a list to the crawler.
As per Configuration, all files under each provided path should be merged into a single schema.

If using something other than boto3, it should be straightforward to provide the aforementioned arguments similarly.

Upvotes: 1

bhrd

Reputation: 81

To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler - Create a single schema for each S3 path.

Screenshot of crawler creation step, with this setting enabled

Here's a detailed explanation - quoting directly, from AWS documentation (reference)

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.

Upvotes: 7

hfaouaz

Reputation: 31

Need to crawl a parent folder with all partition under it, otherwise the crawler will treat each partition as a seperate table. So example, create as such

s3://bucket/table/part=1
s3://bucket/table/part=2
s3://bucket/table/part=3

then crawl s3://bucket/table/

Upvotes: 3

Alexey Bakulin

Reputation: 1369

Try to use table path like s3://bucket-name/<table_name>/pt=<date_time>/file. If after that a Crawler treat every partition like separate table, try to create the table manually and re-run Crawler to bring partitions.

Upvotes: 0

iammehrabalam

Reputation: 1335

Answer is:

Aws glue crawler before merging schema, first find similarity index of the schema(s). If similarity index is more than 70% then merge otherwise create a new table.

Upvotes: 1

AWS Glue does not detect partitions and creates 1000+ tables in catalog

Answers (5)

Related Questions