Reputation: 11
I am looking for the best way to automatically detect new files in a S3 bucket and then to load the data into a Snowflake table.
I know this can be achieved using Snowpipe and SNS, SQS notifications set up in AWS but I would like to have a self-contained solution within Snowflake which can be used for multiple data sources.
I want to have a table which is updated with the file names from a S3 bucket and then to load files which have not already been loaded from S3 into Snowflake.
The only way I have found to automatically detect new files from an external S3 stage in Snowflake so far is to use the code below and a task on a set schedule. This lists the file names and then uses result_scan to display the last query as a table.
list @STAGE_NAME;
set qid=last_query_id();
select "name" from table(result_scan($qid))
Does anyone know a better way to automatically detect new files in an external stage from Snowflake? Any help is much appreciated.
Upvotes: 1
Views: 6479
Reputation: 1
You've already found a viable solution, and your concern about the reliability of the last query id function is understandable. Procedures' sessions are isolated and so the last_query_id() function will be isolated to only the statements executed within that procedure. It might be unnecessary to use a procedure, but I personally like that they let you create reusable abstractions.
An alternative, if you don't like the approach you're using, would be to create a single table with a single VARIANT data column plus the stage metadata columns, maintained by a single giant pipe, and you could maintain a set of materialized views over that table, which would filter, convert variant fields to columns, and sanitize, as appropriate.
There are some benefits:
Notification Integrations enable snowflake to listen (and possibly notify in the future, roadmap-gods willing) to external messaging systems. At this moment only Azure is supported, so it won't work for your case, but keep an eye out over the next few months -- I think it's safe to speculate that we will see this feature grow to support AWS, and a more direct and concise manner for implementing your original solution will eventually become available.
Upvotes: 0
Reputation: 12756
Not necessarily better than the way you've already found, but there is an alternative approach to listing the files in an S3 bucket.
If you create an EXTERNAL TABLE over the data in S3, you can then use the METADATA$FILENAME property in a query. If you have a record of which files have already been loaded into Snowflake then you can compare and select the names of the new files and process them.
e.g.
ALTER EXTERNAL TABLE MYSCHEMA.MYEXTERNALTABLE REFRESH;
SELECT DISTINCT
METADATA$FILENAME as filename
FROM
MYSCHEMA.MYEXTERNALTABLE;
Upvotes: 0