user19771235
user19771235

Reputation: 11

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.

There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.

Based on my research, a user can only query a file in a named internal stage using the following syntax:

--problematic because the order of the columns can change. select t.$1, t.$2 from @mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;

Is there anyway to use the column names instead?

E.g., Select t.first_name from @mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;

I appreciate everyone's help and I do realize that this is an unusual requirement.

Upvotes: 1

Views: 762

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59375

You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.

For example, let's create a .CSV inside Snowflake we can play with later:

create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);

copy into '@my_int_stage/fx3.csv'
from (
    select *
    from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
    limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;

list @my_int_stage
-- 34MB uncompressed CSV, because why not
;

Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:

create or replace function uncsv_py()
returns table(x variant)
language python
imports=('@my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys

IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

class X:
    def process(self):
        with open(import_dir + 'fx3.csv', newline='') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                yield(row, )

$$;

And then you can read this UDF that outputs a table:

select *
from table(uncsv_py())
limit 10

enter image description here

A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.

Upvotes: 1

Related Questions