Reputation: 1
I am trying to run the Snowflake trial. The file is at s3://snowflake-workshop-lab/citibike-trips
Supposed to be CSV but is Parquet. Ideas?
Upvotes: 0
Views: 869
Reputation: 25928
There are three types of data in that bucket now, and the doc's need improving.
The simplest way to fix this is to is to alter your stage to point to just the old trip csv files. A more complex method is to alter the copy command to use a filter.
original code:
create stage citibike_trips url='s3://snowflake-workshop-lab/citibike-trips';
list @citibike_trips;
file path |
---|
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/08/data_01a19496-0601-8b21-003d-9b03003c624a_3106_4_0.snappy.parquet 1,227,891 |
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/09/data_01a19496-0601-8b21-003d-9b03003c624a_1906_6_0.snappy.parquet 1,208,481 |
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/10/data_01a19496-0601-8b21-003d-9b03003c624a_2206_6_0.snappy.parquet 4,866 |
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_0.json.gz |
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_1.json.gz |
s3://snowflake-workshop-lab/citibike-trips/trips_2013_0_0_0.csv.gz |
s3://snowflake-workshop-lab/citibike-trips/trips_2013_0_1_0.csv.gz |
This shows the three file types, the lab only cares about the trips.*csv.gz
parquet files:
create stage citibike_trips_parquet url='s3://snowflake-workshop-lab/citibike-trips-parquet/';
list @citibike_trips_parquet;
JSON files:
create stage citibike_trips_json url='s3://snowflake-workshop-lab/citibike-trips/json/';
list @citibike_trips_json;
CSV files:
create stage citibike_trips_csv url='s3://snowflake-workshop-lab/citibike-trips/trips';
list @citibike_trips_csv;
The data in the lab, the file format appeared wrong to me.
It seems like it is the same as this SQL:
create file format CSV
type = CSV
COMPRESSION = AUTO
DATE_FORMAT = AUTO
TIMESTAMP_FORMAT = AUTO
ESCAPE_UNENCLOSED_FIELD = '\\'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
;
but that gives these two errors:
File 'citibike-trips/trips_2013_0_6_0.csv.gz', line 11, character 172
Row 11, column "TRIPS"["BIRTH_YEAR":15]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
and
Numeric value '' is not recognized File 'citibike-trips/trips_2013_0_5_0.csv.gz', line 4, character 118 Row 4, column "TRIPS"["END_STATION_ID":8]
The latter points to needing to add a NULL_IF = ( '' )
to the format, thus changing the File Format to:
create file format CSV_2
type = CSV
COMPRESSION = AUTO
DATE_FORMAT = AUTO
TIMESTAMP_FORMAT = AUTO
ESCAPE_UNENCLOSED_FIELD = '\\'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ( '' )
;
copy into trips from @citibike_trips_csv file_format=csv_2 ;
Worked:
The way I did it a while back was to have more or less the same file format as above and then control the load with:
create file format csv type = csv
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ('\\N', '');
copy into trips from @citibike_trips
file_format = csv
PATTERN= '.*trips_.*csv.gz';
Upvotes: 2
Reputation: 524
Thanks for pointing this out. I guess you are following the lab https://s3.amazonaws.com/snowflake-workshop-lab/Snowflake_free_trial_LabGuide.pdf
I will get this checked internally, but Yes the files are parquet type.
You can still continue with loafing of the files with parquet and use File-Format = Parquet, instead of csv.
Upvotes: 0