My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet Manual Fixation For a particular date (i.e. a single parquet file), I do some manual fixation Downloaded the parquet file and read it as pandas DataFrame Updated some values, while the column remains unchanged Saved the pandas DataFrame back to parquet file with the same filename Uploaded it back to same S3 bucket sub-folder PS : I seem to have deleted the parquet file on S3 once, leading to empty sub-folder. Then, I re-run the Glue crawler, pointing <folder-name>/ . Unfortunately, data of this particular date is missing in the Athena Table. After the crawler is finished running, the notification is as follow Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>. Is there anything I have mis-configured in my Glue crawler ? Thanks Glue Crawler Config Schema updates in the data store : Update the table definition in the data catalog. Inherit schema from table : Update all new and existing partitions with metadata from the table. Object deletion in the data store : Delete tables and partitions from the data catalog. Crawler Log in CloudWatch BENCHMARK : Running Start Crawl for Crawler <my-table-name> BENCHMARK : Classification complete, writing results to database <my-database-name> INFO : Crawler configured with Configuration { "Version": 1, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } }, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } } and SchemaChangePolicy { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "DELETE_FROM_DATABASE" } . Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets. BENCHMARK : Finished writing to Catalog BENCHMARK : Crawler has finished running and is in state READY

Reputation: 379

Glue Crawler Skips a Particular S3 Folder

My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet

Manual Fixation

For a particular date (i.e. a single parquet file), I do some manual fixation

Downloaded the parquet file and read it as pandas DataFrame
Updated some values, while the column remains unchanged
Saved the pandas DataFrame back to parquet file with the same filename
Uploaded it back to same S3 bucket sub-folder

PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.

Then, I re-run the Glue crawler, pointing <folder-name>/. Unfortunately, data of this particular date is missing in the Athena Table.

After the crawler is finished running, the notification is as follow

Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.

Is there anything I have mis-configured in my Glue crawler ? Thanks

Glue Crawler Config

Schema updates in the data store: Update the table definition in the data catalog.
Inherit schema from table: Update all new and existing partitions with metadata from the table.
Object deletion in the data store: Delete tables and partitions from the data catalog.

Crawler Log in CloudWatch

BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
    "Version": 1,
    "CrawlerOutput": {
        "Partitions": {
            "AddOrUpdateBehavior": "InheritFromTable"
        }
    },
    "Grouping": {
        "TableGroupingPolicy": "CombineCompatibleSchemas"
    }
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.

BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY

Upvotes: 0

Answers (2)

clene

Reputation: 1

I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:

"Resource": [
    "arn:aws:s3:::bucket/object*"
]

When the crawler didn't work, I instead had the following:

"Resource": [
    "arn:aws:s3:::bucket/object"
]

Upvotes: 0

Gaurav Wasan

Reputation: 86

If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders

Upvotes: 0

Glue Crawler Skips a Particular S3 Folder

Manual Fixation

Glue Crawler Config

Crawler Log in CloudWatch

Answers (2)

Related Questions