mixermt
mixermt

Reputation: 383

Unable to query parquet data with nested fields in presto db

I have data, some of each includes nests columns (arrays of arrays of objects), saved as PARQUET in Spark 2.2.

Now I'm trying to access this data externally with presto and I get following exception when I'm trying to access any nested column.

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://name-node/parquet_path/part-00023-8d4f14b1-a3f1-4055-b931-04838701048d-c000.snappy.parquet (offset=0, length=108289): parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:220)
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:115)
    at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:157)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:56)
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:90)
 at com.facebook.presto.hive.parquet.ParquetPageSource.<init>(ParquetPageSource.java:109)

What is interesting that I'm able to query other non nested columns without any issues.

Create table looks like following:

CREATE TABLE hive.tests.table_name (
not_nested_field_1 BIGINT,
not_nested_field_2 BIGINT,
not_nested_field_3 BOOLEAN,
not_nested_field_4 DOUBLE,
not_nested_field_5 ARRAY(VARCHAR),
not_nested_field_5 ARRAY(ROW(
    nested_level0_field1 BOOLEAN,
    nested_level0_field2 BIGINT,
    nested_level0_field3 BIGINT,
    nested_level0_field4 ARRAY(ROW(
        nested_level1_field1 BOOLEAN,
        nested_level1_field2 BIGINT,
        nested_level1_field3 VARCHAR,
        nested_level1_field4 ARRAY(ROW(
            nested_level2_field1 VARCHAR,
            nested_level2_field2 VARCHAR,
            nested_level2_field3 ARRAY(ROW(
                nested_level3_field1 VARCHAR,
                nested_level3_field2 VARCHAR)))),
        nested_level1_field5 ARRAY(ROW(
            nested_level2_field4 BIGINT,
            nested_level2_field5 BIGINT,
            nested_level2_field6 ARRAY(ROW(
                nested_level3_field3 VARCHAR,
                nested_level3_field4 VARCHAR)))))))))
WITH (
  format = 'PARQUET',
  external_location = 'hdfs://name-node/parquet_path/'
);

Using presto version 0.208, using local Hive metastore for creating external tables.

Any help would be appreciated :)

Upvotes: 3

Views: 7041

Answers (1)

mixermt
mixermt

Reputation: 383

The issue was resolved with hive.parquet.use-column-names=true property defined in catalog/hive.properties

By default presto will use column indexes to access data so need define explicitly this property so it will use column names in parquet as defined in CREATE TABLE.

Upvotes: 9

Related Questions