My s3 location has the below structure s3://bucketname/snapshot/db_collection/snapshot1/*.parquet s3://bucketname/snapshot/db_collection/snapshot2/*.parquet s3://bucketname/snapshot/db_collection/snapshot3/*.parquet What I want is to be able to define the trino table at the level s3://bucketname/snapshot/db_collection/ ; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX is not of format <abc>=<efg> format. is there any tool/ way which can generate the table automatically out of the parquet file or the schema -json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy I tried to run aws glue crawler --to generate the table and use athena for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path. My existing table definition is as follows create table trino.db_collection ( col1 varchar, col2 varchar, col3 varchar )with ( external_location = 's3a://bucket/trino/db_collection/*', format = 'PARQUET' ) My setup is AWS EMR 6.8.0 with trino-v388 .

Reputation: 838

How to create a partitioned Trino table on S3 (with sub-folders)

My s3 location has the below structure

s3://bucketname/snapshot/db_collection/snapshot1/*.parquet
s3://bucketname/snapshot/db_collection/snapshot2/*.parquet
s3://bucketname/snapshot/db_collection/snapshot3/*.parquet

What I want is

to be able to define the trino table at the level s3://bucketname/snapshot/db_collection/; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX is not of format <abc>=<efg> format.
is there any tool/ way which can generate the table automatically out of the parquet file or the schema -json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy
I tried to run aws glue crawler --to generate the table and use athena for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path.

My existing table definition is as follows


create table trino.db_collection (
        col1 varchar,
        col2 varchar,
        col3 varchar
)with (
    external_location = 's3a://bucket/trino/db_collection/*',
    format = 'PARQUET'
)

My setup is AWS EMR 6.8.0 with trino-v388.

Upvotes: 2

Answers (2)

tushararora19

Reputation: 51

Above is missing ARRAY keyword

create table trino.db_collection (
    col1 varchar,
    col2 varchar,
    col3 varchar
)with (
    external_location = 's3a://bucket/trino/db_collection/*',
    format = 'PARQUET',
    partitioned_by = ARRAY['col1','col2']
)

Upvotes: 3

Lior Friedler

Reputation: 11

Regarding partitions:

As you mentioned, automatic partition discovery won't work because Trino looks for the hive format col_name=value. As a best practice I would recommend to run a one-time procedure to rename the keys, however, if this is not possible, you can still manually register partitions using the register_partition system procedure. It's just tedius to maintain.

system.register_partition(schema_name, table_name, partition_columns, partition_values, location)

Please note you'll also need to edit your installation config and enable this on the catalog properties file.

From the docs (https://trino.io/docs/current/connector/hive.html#procedures.):

Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure is set to true.

The partition column has to be in the last in your table schema, and the parittioned_by property defined in the table properties.

So in your example:

create table trino.db_collection (
        col1 varchar,
        col2 varchar,
        col3 varchar,
        snapshot varchar
)with (
    external_location = 's3a://bucket/trino/db_collection/*',
    format = 'PARQUET',
    partitioned_by = ['snapshot']
)

Regarding inferring the table schema:

This is not supported in Trino but can be done in Spark/Glue Crawler. If you register the table in the glue catalog it can be read by Trino as well.

Can you share the errors you got when selecting?

Upvotes: 1

How to create a partitioned Trino table on S3 (with sub-folders)

Answers (2)

Related Questions