Ahmad Mousavi
Ahmad Mousavi

Reputation: 601

From S3 to Snowflake and performance

I have got a huge number of events stored in S3 as small JSON files. now I need to ingest these files into Snowflake using Snowpipes. is there any performance concern around the number of requests being sent to Snowflake? Shall I merge these tiny files into a bigger JSON and then let the Snowflake ingest it?

I know Snowflake can automatically detect changes on S3 and try to refresh its external tables, but shall I let small files constantly trigger this process or not?

Upvotes: 0

Views: 825

Answers (2)

Chris
Chris

Reputation: 690

Yes, there is a limit on the number of API requests that can be made. This rate exists at the account level, not the user or pipe level. Each endpoint has their own bank of tokens.

loadHistoryScan:

Starts with 20,000 tokens.
1 token consumed per call
5 refilled for each ingested file
1 refilled each minute


ingestFile:

Starts with unknown
1 token consumed per file
15,000 refilled each minute

For the ingestFile endpoint, 1 token is used for each file in your API call, this is true whether your submit 5000 calls with one file OR one call with 5000 files.

It is important that you not resubmit the same files as you will consume a token for each of those resubmitted files, but the copy_history table function does not tell you if a pipe has skipped a file.


insertReport:

Starts with unknown
1 token consumed per call
Refill rate unknown


The maximum number of tokens an account can have at any time is 100,000; once you reach this point the refills stop until you start making API calls and using tokens again.

Snowflake recommends file sizes for loading be between 10 MB and 100 MB compressed. If your requirements allow you the time to consolidate these small files into a larger one, then yes I would recommend that approach.

Upvotes: 3

David Garrison
David Garrison

Reputation: 2880

Snowflake Has some documentation that should answer this pretty well here. In short: Ideally your files are big, but not so big and/or complicated that they take more than a minute to process.

I have some snowpipes that handle a lot of small files without much trouble, but you'll probably get at least slightly better performance out of larger files. In general, SnowFlake is optimized for larger batches.

Another note is that Snowpipe pricing includes number of files loaded, so you might be looking at some cost savings by merging tiny files together as well.

Upvotes: 2

Related Questions