Reputation: 1215
As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
Upvotes: 43
Views: 19407
Reputation: 11487
If you want to resolve this issue with CDK code, here's the example:
const crawler = new CfnCrawler(scope, name, {
name: name,
description: "Glue crawler to fetch CloudWatch metrics data",
role: role.roleArn,
targets: {s3Targets: [{path: 's3://' + service.bucket + '/'},],},
schedule: {scheduleExpression: 'cron(0 * * * ? *)'},
databaseName: this.databaseName,
recrawlPolicy: {recrawlBehavior: 'CRAWL_NEW_FOLDERS_ONLY',},
// Prevent the crawler from changing an existing schema
// https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent
configuration: '{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }',
schemaChangePolicy: {deleteBehavior: 'LOG', updateBehavior: 'LOG'},
});
When you configure the crawler using the API, set the following parameters:
- Set the UpdateBehavior field in SchemaChangePolicy structure to LOG.
- Set the Configuration field with a string representation of the following JSON object in the crawler API; for example.
Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent
Upvotes: 0
Reputation: 71
Despite selecting Update all new and existing partitions with metadata from the table.
in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath
wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run
Upvotes: 0
Reputation: 851
It also fixed my issue! If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
Upvotes: 5
Reputation: 816
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
Upvotes: 70
Reputation: 345
This helped me. Posting the image for others in case the link is lost
Upvotes: 5