COPY from Parquet S3 into Redshift and decimal vs. int types

Question

I am running into this error when trying to COPY data from Parquet in S3 to Redshift:

S3 Query Exception (Fetch). Task failed due to an internal error. File
 'https://...../part-00000-xxxxx.snappy.parquet  
has an incompatible Parquet schema for column 's3://table_name/.column_name'. 
Column type: INT, Parquet schema:
optional fixed_len_byte_array COLUMN_NAME

I suspect this is because the Parquet file has a numeric/decimal type with a greater precision than fit into an INT column, however I believe that all the actual values are within range that they would fit. (The error does not specify a row number.)

Is there a way to coerce the type conversion on COPY, and take failures at an individual row basis (as with CSV) rather than failing the whole file?

Daniel R Carletti · Accepted Answer

Spent a day on a similar issue, and found no way to coerce types on the COPY command. I was building my parquet files with Pandas, and had to match the data types to the ones in Redshift. For integers, I had Pandas int64 with Redshift BIGINT. Similarly, I had to change NUMERIC columns to DOUBLE PRECISION (Pandas float64).

The file fails as a whole because the COPY command for columnar files (like parquet) copies the entire column and then moves on to the next. So there is no way to fail each individual row. See AWS Documentation .

COPY from Parquet S3 into Redshift and decimal vs. int types

Answers (1)

Related Questions