krx
krx

Reputation: 115

How to upload local files to Snowflake faster?

I need to upload files > 10gb size, to Snowflake tables.

The current method I'm using is the python Snowflake Connector :

# Create Staging table
query1 = "create or replace stage demo_stage file_format = (TYPE=CSV) ;"
execute_query(conn, query1)

# Upload file from local to staging table
query2 = "put file://file.csv @demo_stage auto_compress=true"
execute_query(conn, query2)

# Upload file from staging to final table
query3 = "copy into demo from @demo_stage/file.csv.gz" \
"file_format = 'CSV' on_error=continue;"
execute_query(conn, query3)

However this method takes a lot of time for my files.

Is there any way to optimize it? or any alternative method?

Upvotes: 0

Views: 1408

Answers (2)

Following snowflake guidelines and similar to lukasz's suggestion, you should split your ~10gb file into chunks of 250-300 mb each (this is best practice) using third-party utilities. You can use tools like Open-refine for this splitting.

After that, you can go ahead with the put command loading of each file into your internal stages (same as your code above).

Ps: You should also consider using a multi-cluster warehouse for this loading activity.

As far as alternatives go, other routes you can explore to upload local files faster into snowflake are:

  • Prebuilt 3rd-party modeling tools
  • Snowpipe i.e. if you want to automate the ingestion into snowflake.

I actually work with a team that's working on a prebuilt tool for easy loading into snowflake -Datameer, feel free to check it out here if you wish https://www.datameer.com/upload-csv-to-snowflake/

Upvotes: 0

Lukasz Szozda
Lukasz Szozda

Reputation: 175556

In order to improve performance of upload it is advisable to generate smaller CSV files.

The PUT command allows to define PARALLEL option:

Specifies the number of threads to use for uploading files. The upload process separate batches of data files by size:

  • Small files (< 64 MB compressed or uncompressed) are staged in parallel as individual files.

  • Larger files are automatically split into chunks, staged concurrently, and reassembled in the target stage. A single thread can upload multiple chunks.

Increasing the number of threads can improve performance when uploading large files.

Supported values: Any integer value from 1 (no parallelism) to 99 (use 99 threads for uploading files).

Default: 4

# Upload file from local to staging table
query2 = "put file://file.csv @demo_stage auto_compress=true parallel=X"
execute_query(conn, query2)

Upvotes: 2

Related Questions