AWS Glue not detecting partition (created by different method Athena vs Glue)

I have parquet files in S3 created using different sources. They have the same schema. 1 is created using Athena CTAS. Another is created using AWS Glue/Spark.

The files created by Glue looks like:

Athena CTAS ones looks like:

I tried copying the files that are in missing partitions into another folder then use a Glue crawler and Glue can detect that. But it cannot seem to detect these partitions when everything is put together. Why is that? Do I need to process all the data using 1 method for this to work?

Upvotes: 1

Answers (2)

Jiew Meng

Reputation: 88337

Ok, I found the issue. 2 main issues

Athena output bigint while spark output int
Some columns have different case like: countryname vs countryName

One useful tip is to either printSchema of each partition and compare using diff. Or check AWS Glue Data Catalog table partition and see the difference in partitions there.

Upvotes: 1

Ryan

Reputation: 299

If you have added data to a new partition Glue should detect it if the schema matches.

You could try doing it manually with Athena and see if that works. Hopefully it will at least give you a helpful error.

ALTER TABLE orders ADD
  PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
  PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

source: https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

You could also try loading and printing the schema for both partitions and see if something is off?

Without more specifics, Ex. examples of how you are actually partitioning, I don't think I can help much more.

You should try to come up with a more reproducible example.

Upvotes: 2

AWS Glue not detecting partition (created by different method Athena vs Glue)

Answers (2)

Related Questions