Reputation: 9863
I'm trying to query data from S3 using AWS Athena, where the data is stored in Parquet format. Specifically, I am trying to create a nested schema that stores rows of a complex object, generated using the parquetjs library. Here is an example of how I am generating the data:
const schema = {
id: {type: 'UTF8'},
body: {
repeated: true,
fields: {
text: {type: 'UTF8'},
},
},
};
const obj = {
id: '123',
body: [
{text: 'Hello'},
{text: 'world!'},
],
};
const parquetSchema = new parquet.ParquetSchema(schema);
const writer = await parquet.ParquetWriter.openFile(parquetSchema, fileName);
In AWS Athena, I have created an external table with the following structure:
CREATE EXTERNAL TABLE `tabletest`(
`id` string,
`body` array<struct<text:string>>
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xyz/parquet_test'
TBLPROPERTIES (
'classification'='parquet',
'transient_lastDdlTime'='1679107188')
However, when I try to query the data using SELECT * FROM tabletest
, I get the following error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://xyz/parquet_test/d1710fd7dde563dc9e0348211825e726 (offset=0, length=272): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
I'm not sure what is causing this error or how to resolve it. Any suggestions or insights would be greatly appreciated.
Upvotes: 1
Views: 692