Reputation: 1615

AWS Glue Crawler: want separate table for folder in s3

My s3 file structure is:

├── bucket
│   ├── customer_1
│   │   ├── year=2016
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── sometype-2017-11-01.parquet
│   |   |   |   ├── sometype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   │   ├── month=12
│   │   |   │   ├── sometype-2017-12-01.parquet
│   |   |   |   ├── sometype-2017-12-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=2018
│   │   │   ├── month=01
│   │   |   │   ├── sometype-2018-01-01.parquet
│   |   |   |   ├── sometype-2018-01-02.parquet
│   |   |   |   ├── ...
│   ├── customer_2
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── moretype-2017-11-01.parquet
│   |   |   |   ├── moretype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=...

I want create separate table for customer_1 and customer_2 with AWS Glue crawler. It is working if i mention path s3://bucket/customer_1 and s3://bucket/customer_2.

I've tried s3://bucket/customer_* and s3://bucket/*, both are not working and can not create table in Glue catalog

Upvotes: 4

Answers (2)

Sandeep Singh

Reputation: 508

I myself faced this issue recently. AWS GLUE Crawlers has this option Grouping behaviour for S3 data. If the checkbox is not selected it will try to combine schemas. By selecting the checkbox you can ensure that multiple and separate databases are created.

The table level should be the depth from the root of the bucket, from where you want separate tables.

In your case the depth would be 2.

More here

Upvotes: 4

Kishore Bharathy

Reputation: 451

Glue's natural tendency is to add similar schemas(when pointed to the parent folder) to the same table with anything over than a 70% match(Assuming, In your case Cust1 and Cust2 have the same schemas). Keeping them in individual folders might create respective partitions based on the folder names.

Upvotes: 2

AWS Glue Crawler: want separate table for folder in s3

Answers (2)

Related Questions