Reputation: 88189
I have parquet files in S3 created using different sources. They have the same schema. 1 is created using Athena CTAS. Another is created using AWS Glue/Spark.
The files created by Glue looks like:
Athena CTAS ones looks like:
I tried copying the files that are in missing partitions into another folder then use a Glue crawler and Glue can detect that. But it cannot seem to detect these partitions when everything is put together. Why is that? Do I need to process all the data using 1 method for this to work?
Upvotes: 1
Views: 4720
Reputation: 88189
Ok, I found the issue. 2 main issues
One useful tip is to either printSchema of each partition and compare using diff. Or check AWS Glue Data Catalog table partition and see the difference in partitions there.
Upvotes: 1
Reputation: 299
If you have added data to a new partition Glue should detect it if the schema matches.
You could try doing it manually with Athena and see if that works. Hopefully it will at least give you a helpful error.
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
source: https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html
You could also try loading and printing the schema for both partitions and see if something is off?
Without more specifics, Ex. examples of how you are actually partitioning, I don't think I can help much more.
You should try to come up with a more reproducible example.
Upvotes: 2