Reputation: 838
My s3 location has the below structure
s3://bucketname/snapshot/db_collection/snapshot1/*.parquet
s3://bucketname/snapshot/db_collection/snapshot2/*.parquet
s3://bucketname/snapshot/db_collection/snapshot3/*.parquet
What I want is
to be able to define the trino
table at the level s3://bucketname/snapshot/db_collection/
; so that if I query for a row and it exists in 2 snapshots then I get 2 rows as output. I was not able to find how to write a create table query
for this use-case (which essentially is a partition use-case). Also note that the partition folder snapshotX
is not of format <abc>=<efg>
format.
is there any tool/ way which can generate the table
automatically out of the parquet
file or the schema
-json file. Why I ask is because -- my parquet file has 150 columns and each column is again nested etc. Writing a table by hand is not easy
I tried to run aws glue crawler
--to generate the table and use athena
for querying, but when I run select query i get into weird errors which scares me out. So I don't want to use this path.
My existing table definition is as follows
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET'
)
My setup is AWS EMR 6.8.0
with trino-v388
.
Upvotes: 2
Views: 14710
Reputation: 51
Above is missing ARRAY keyword
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ARRAY['col1','col2']
)
Upvotes: 3
Reputation: 11
Regarding partitions:
col_name=value
. As a best practice I would recommend to run a one-time procedure to rename the keys, however, if this is not possible, you can still manually register partitions using the register_partition system procedure. It's just tedius to maintain.system.register_partition(schema_name, table_name, partition_columns, partition_values, location)
Please note you'll also need to edit your installation config and enable this on the catalog properties file.
From the docs (https://trino.io/docs/current/connector/hive.html#procedures.):
Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure is set to true.
So in your example:
create table trino.db_collection (
col1 varchar,
col2 varchar,
col3 varchar,
snapshot varchar
)with (
external_location = 's3a://bucket/trino/db_collection/*',
format = 'PARQUET',
partitioned_by = ['snapshot']
)
Regarding inferring the table schema:
Can you share the errors you got when selecting?
Upvotes: 1