Reputation: 1

Snowflake trial data in wrong format

I am trying to run the Snowflake trial. The file is at s3://snowflake-workshop-lab/citibike-trips

Supposed to be CSV but is Parquet. Ideas?

Upvotes: 0

Answers (2)

Simeon Pilgrim

Reputation: 26140

There are three types of data in that bucket now, and the doc's need improving.

The simplest way to fix this is to is to alter your stage to point to just the old trip csv files. A more complex method is to alter the copy command to use a filter.

original code:

create stage citibike_trips url='s3://snowflake-workshop-lab/citibike-trips';

list @citibike_trips;

file path
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/08/data_01a19496-0601-8b21-003d-9b03003c624a_3106_4_0.snappy.parquet 1,227,891
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/09/data_01a19496-0601-8b21-003d-9b03003c624a_1906_6_0.snappy.parquet 1,208,481
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/10/data_01a19496-0601-8b21-003d-9b03003c624a_2206_6_0.snappy.parquet 4,866
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_0.json.gz
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_1.json.gz
s3://snowflake-workshop-lab/citibike-trips/trips_2013_0_0_0.csv.gz
s3://snowflake-workshop-lab/citibike-trips/trips_2013_0_1_0.csv.gz

This shows the three file types, the lab only cares about the trips.*csv.gz

Simple solution:

parquet files:

create stage citibike_trips_parquet url='s3://snowflake-workshop-lab/citibike-trips-parquet/';


list @citibike_trips_parquet;

JSON files:

create stage citibike_trips_json url='s3://snowflake-workshop-lab/citibike-trips/json/';

list @citibike_trips_json;

CSV files:

create stage citibike_trips_csv url='s3://snowflake-workshop-lab/citibike-trips/trips';

list @citibike_trips_csv;

Other points:

The data in the lab, the file format appeared wrong to me.

It seems like it is the same as this SQL:

create file format CSV
    type = CSV
    COMPRESSION = AUTO 
    DATE_FORMAT = AUTO
    TIMESTAMP_FORMAT = AUTO
    ESCAPE_UNENCLOSED_FIELD = '\\'
    FIELD_OPTIONALLY_ENCLOSED_BY = '"'
   ;

but that gives these two errors:

File 'citibike-trips/trips_2013_0_6_0.csv.gz', line 11, character 172

Row 11, column "TRIPS"["BIRTH_YEAR":15]

If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.

and

Numeric value '' is not recognized File 'citibike-trips/trips_2013_0_5_0.csv.gz', line 4, character 118 Row 4, column "TRIPS"["END_STATION_ID":8]

The latter points to needing to add a NULL_IF = ( '' ) to the format, thus changing the File Format to:

create file format CSV_2
    type = CSV
    COMPRESSION = AUTO 
    DATE_FORMAT = AUTO
    TIMESTAMP_FORMAT = AUTO
    ESCAPE_UNENCLOSED_FIELD = '\\'
    FIELD_OPTIONALLY_ENCLOSED_BY = '"'
    NULL_IF = ( '' )
   ;

copy into trips from @citibike_trips_csv file_format=csv_2 ;

Worked:

File Filter method:

The way I did it a while back was to have more or less the same file format as above and then control the load with:

create file format csv type = csv 
    FIELD_OPTIONALLY_ENCLOSED_BY = '"' 
    NULL_IF = ('\\N', '');

copy into trips from @citibike_trips 
    file_format = csv 
    PATTERN= '.*trips_.*csv.gz';

Upvotes: 2

Anshul Thakur

Reputation: 524

Thanks for pointing this out. I guess you are following the lab https://s3.amazonaws.com/snowflake-workshop-lab/Snowflake_free_trial_LabGuide.pdf

I will get this checked internally, but Yes the files are parquet type.

You can still continue with loafing of the files with parquet and use File-Format = Parquet, instead of csv.