Reputation: 1604
Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.
I saw this and also another duplication
but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.
Upvotes: 16
Views: 10775
Reputation: 2223
This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.
If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE
.
Upvotes: 8
Reputation: 1654
It doesn't seem to be possible with Athena as avro.schema.url
is not a supported property.
table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)
You can use avro.schema.literal
(you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.
Strange errors like:
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
Upvotes: 2