Caesar
Caesar

Reputation: 9863

AWS Athena error when querying nested schema stored as Parquet format

I'm trying to query data from S3 using AWS Athena, where the data is stored in Parquet format. Specifically, I am trying to create a nested schema that stores rows of a complex object, generated using the parquetjs library. Here is an example of how I am generating the data:

const schema = {
  id: {type: 'UTF8'},
  body: {
    repeated: true,
    fields: {
      text: {type: 'UTF8'},
    },
  },
};
const obj = {
  id: '123',
  body: [
    {text: 'Hello'},
    {text: 'world!'},
  ],
};
const parquetSchema = new parquet.ParquetSchema(schema);

const writer = await parquet.ParquetWriter.openFile(parquetSchema, fileName);

In AWS Athena, I have created an external table with the following structure:

CREATE EXTERNAL TABLE `tabletest`(
  `id` string,
  `body` array<struct<text:string>>
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://xyz/parquet_test'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1679107188')

However, when I try to query the data using SELECT * FROM tabletest, I get the following error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://xyz/parquet_test/d1710fd7dde563dc9e0348211825e726 (offset=0, length=272): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

I'm not sure what is causing this error or how to resolve it. Any suggestions or insights would be greatly appreciated.

Upvotes: 1

Views: 692

Answers (0)

Related Questions