Alex
Alex

Reputation: 77

How to dynamically create table in Snowflake getting schema from parquet file which stored in AWS

Could you help me to load a couple of parquet files to Snowflake.

I've got about 250 parquet-files which stored in AWS stage.

250 files = 250 different tables.

I'd like to dynamically load them into Snowflake tables.

So, I need:

  1. Get schema from parquet file... I've read that I could get the schema from parquet file using parquet-tools (Apache).
  2. Create table using schema from the parquet file
  3. Load data from parquet-file to this table.

Could anyone help me how to do that? Does exist the most efficient way to realize it? (by using GUI Snowflake, for example). Can't find it.

Thanks.

Upvotes: 0

Views: 5238

Answers (2)

Secret squirrel
Secret squirrel

Reputation: 131

This process is automatable provided you can use SnowSQL, and ideally bash or Powershell scripting.
At a pinch you can generate the SQL in Excel using 1 row per file and 1 column per sql command, then for the last column concat all the previous sql command columns together with a newline character.

The steps are as follows

Obtain each filename.

Option A

  • Create a separate stage for each file
  • Load the file to the stage using PUT
  • Create a table using INFER schema
  • Copy INTO the created table from the stage using
  • Drop or Purge the stage.

Option B

  • Create a single stage
  • Load all the files using PUT & wildcard file names
  • Create a table using INFER schema and the exact filename and path
  • Drop or purge the stage.

I prefer option A as its cleaner and I have found Infer_schema tricky with filepaths.

Incidentally the example syntax for Create table using infer schema on a parquet file isnt listed as an example.

This is the correct syntax.

CREATE STAGE my_stage;

CREATE OR REPLACE FILE FORMAT my_parquet_format TYPE = parquet;

PUT 'file:///<filepath>/<filename or wildcard>.snappy.parquet' @my_stage;

CREATE TABLE mytable
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION=>'@my_stage',
          FILE_FORMAT=>'my_parquet_format'
        )
      ));

COPY INTO mytable FROM @my_stage FILE_FORMAT = (FORMAT_NAME= 'my_parquet_format') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;

Upvotes: 1

Anshul Thakur
Anshul Thakur

Reputation: 524

If the schema of the files is same you can put them in a single stage and use the Infer-Schema function. This will give you the schema of the parquet files. https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html

In case all files have different schema then I'm afraid you have to infer the schema on each file.

Upvotes: 1

Related Questions