Parquet Datatype Issue

I have a bunch of parquet files and I've created external tables using Impala's CREATE EXTERNAL TABLE....

However when I issued

select * from my_table

it returned

ERROR: File 'hdfs://xyz..' has an incompatible Parquet schema for column 'xyz'. Column type: DECIMAL(5, 2), Parquet schema:optional int32 final_apr [i:4 d:1 r:0].

So I closely looked at the output of command

parquet-tools schema my_parquet

and observed that all the columns which are INT64 but not fixed_len_byte_array having this issue.

So I manually executed the command

ALTER TABLE schema.table_name CHANGE old-column new_column BIGINT;

to change the column type whatever it created by default to BIGINT per https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_parquet.html, then I was able to execute the select * from my_table;

I've exercised the above in scala using spark, i.e I was able to read parquet file and store it as impala table and also can able to issue "select * from my_table" query programmatically.

However I tried to manually query from impala shell, I'm encountering the same error shown above.

Instead of manually updating the columns to change the datatype of all INT64 non-fixed-length byte array, is there any better way to handle this? May be something like after reading parquet file as Dataframe, find out all INT64 non-fixed-length byte array column convert them to BIGINT then save the dataframe as table?

BTW, I'm not able to execute select * from my_from; query in hive shell even after changing the column's data type. I'm getting errors msg

Error: Error while compiling statement: FAILED: ParseException line 1:30 extraneous input 'limit' expecting Number near '' (state=42000,code=40000).

I tried with couple tables, all of them are having the same issue with hive shell.

Appreciate your help!

Upvotes: 1

Answers (2)

Jeremy

Reputation: 1924

I had a similar issue copying parquet data from s3 to redshift with a decimal column. I was getting

  code:      15001
  context:   File 'https://s3.us-east-1.amazonaws.com/bucket/filename.snappy.parquet  
             has an incompatible Parquet schema for column 
             's3://bucket/filename/.column_name'. Column type: DECIMAL(9

The column in redshift was a DECIMAL(9,5). The column in parquet was a double. I couldn't change the database because production. I ended up modifying the parquet output like so:

val nineFive =  DataTypes.createDecimalType(9,5)
val df2 = df.withColumn("column_name", $"column_name" cast nineFive)

Upvotes: 0

HakkiBuyukcengiz

Reputation: 419

Parquet schema's warning int32 means that your impala column type should be int not bigint. When you alter table column to int, it should work. Btw, spark and impala parquet read methods are different. Impala follows a more conservative way to read parquet files. For example; spark does not control the number of values in different column indices while the impala do.

Upvotes: 1

Parquet Datatype Issue

Answers (2)

Related Questions