K_inverse
K_inverse

Reputation: 379

Glue Crawler Skips a Particular S3 Folder

My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet

Manual Fixation

For a particular date (i.e. a single parquet file), I do some manual fixation

PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.

Then, I re-run the Glue crawler, pointing <folder-name>/. Unfortunately, data of this particular date is missing in the Athena Table.

After the crawler is finished running, the notification is as follow

Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.

Is there anything I have mis-configured in my Glue crawler ? Thanks

Glue Crawler Config

Crawler Log in CloudWatch

BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
    "Version": 1,
    "CrawlerOutput": {
        "Partitions": {
            "AddOrUpdateBehavior": "InheritFromTable"
        }
    },
    "Grouping": {
        "TableGroupingPolicy": "CombineCompatibleSchemas"
    }
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.

BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY

Upvotes: 0

Views: 1267

Answers (2)

clene
clene

Reputation: 1

I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:

"Resource": [
    "arn:aws:s3:::bucket/object*"
]

When the crawler didn't work, I instead had the following:

"Resource": [
    "arn:aws:s3:::bucket/object"
]

Upvotes: 0

Gaurav Wasan
Gaurav Wasan

Reputation: 86

If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders

Upvotes: 0

Related Questions