Reputation: 115
I need to upload files > 10gb size, to Snowflake tables.
The current method I'm using is the python Snowflake Connector :
# Create Staging table
query1 = "create or replace stage demo_stage file_format = (TYPE=CSV) ;"
execute_query(conn, query1)
# Upload file from local to staging table
query2 = "put file://file.csv @demo_stage auto_compress=true"
execute_query(conn, query2)
# Upload file from staging to final table
query3 = "copy into demo from @demo_stage/file.csv.gz" \
"file_format = 'CSV' on_error=continue;"
execute_query(conn, query3)
However this method takes a lot of time for my files.
Is there any way to optimize it? or any alternative method?
Upvotes: 0
Views: 1408
Reputation: 1
Following snowflake guidelines and similar to lukasz's suggestion, you should split your ~10gb file into chunks of 250-300 mb each (this is best practice) using third-party utilities. You can use tools like Open-refine for this splitting.
After that, you can go ahead with the put command loading of each file into your internal stages (same as your code above).
Ps: You should also consider using a multi-cluster warehouse for this loading activity.
As far as alternatives go, other routes you can explore to upload local files faster into snowflake are:
I actually work with a team that's working on a prebuilt tool for easy loading into snowflake -Datameer, feel free to check it out here if you wish https://www.datameer.com/upload-csv-to-snowflake/
Upvotes: 0
Reputation: 175556
In order to improve performance of upload it is advisable to generate smaller CSV files.
The PUT command allows to define PARALLEL option:
Specifies the number of threads to use for uploading files. The upload process separate batches of data files by size:
Small files (< 64 MB compressed or uncompressed) are staged in parallel as individual files.
Larger files are automatically split into chunks, staged concurrently, and reassembled in the target stage. A single thread can upload multiple chunks.
Increasing the number of threads can improve performance when uploading large files.
Supported values: Any integer value from 1 (no parallelism) to 99 (use 99 threads for uploading files).
Default: 4
# Upload file from local to staging table
query2 = "put file://file.csv @demo_stage auto_compress=true parallel=X"
execute_query(conn, query2)
Upvotes: 2