Reputation: 379
My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet
For a particular date (i.e. a single parquet file), I do some manual fixation
PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.
Then, I re-run the Glue crawler, pointing <folder-name>/
. Unfortunately, data of this particular date is missing in the Athena Table.
After the crawler is finished running, the notification is as follow
Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.
Is there anything I have mis-configured in my Glue crawler ? Thanks
BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
"Version": 1,
"CrawlerOutput": {
"Partitions": {
"AddOrUpdateBehavior": "InheritFromTable"
}
},
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY
Upvotes: 0
Views: 1267
Reputation: 1
I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:
"Resource": [
"arn:aws:s3:::bucket/object*"
]
When the crawler didn't work, I instead had the following:
"Resource": [
"arn:aws:s3:::bucket/object"
]
Upvotes: 0
Reputation: 86
If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders
Upvotes: 0