Reputation: 159
I have a bunch of parquet files and I've created external tables using Impala's CREATE EXTERNAL TABLE...
.
However when I issued
select * from my_table
it returned
ERROR: File 'hdfs://xyz..' has an incompatible Parquet schema for column 'xyz'. Column type: DECIMAL(5, 2), Parquet schema:optional int32 final_apr [i:4 d:1 r:0].
So I closely looked at the output of command
parquet-tools schema my_parquet
and observed that all the columns which are INT64
but not fixed_len_byte_array
having this issue.
So I manually executed the command
ALTER TABLE schema.table_name CHANGE old-column new_column BIGINT;
to change the column type whatever it created by default to BIGINT
per https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_parquet.html, then I was able to execute the select * from my_table;
I've exercised the above in scala using spark, i.e I was able to read parquet file and store it as impala table and also can able to issue "select * from my_table" query programmatically.
However I tried to manually query from impala shell, I'm encountering the same error shown above.
Instead of manually updating the columns to change the datatype of all INT64
non-fixed-length byte array, is there any better way to handle this? May be something like after reading parquet file as Dataframe, find out all INT64
non-fixed-length byte array column convert them to BIGINT
then save the dataframe as table?
BTW, I'm not able to execute select * from my_from;
query in hive shell even after changing the column's data type. I'm getting errors msg
Error: Error while compiling statement: FAILED: ParseException line 1:30 extraneous input 'limit' expecting Number near '' (state=42000,code=40000).
I tried with couple tables, all of them are having the same issue with hive shell.
Appreciate your help!
Upvotes: 1
Views: 24428
Reputation: 1924
I had a similar issue copying parquet data from s3 to redshift with a decimal column. I was getting
code: 15001
context: File 'https://s3.us-east-1.amazonaws.com/bucket/filename.snappy.parquet
has an incompatible Parquet schema for column
's3://bucket/filename/.column_name'. Column type: DECIMAL(9
The column in redshift was a DECIMAL(9,5)
. The column in parquet was a double
. I couldn't change the database because production. I ended up modifying the parquet output like so:
val nineFive = DataTypes.createDecimalType(9,5)
val df2 = df.withColumn("column_name", $"column_name" cast nineFive)
Upvotes: 0
Reputation: 419
Parquet schema's warning int32 means that your impala column type should be int not bigint. When you alter table column to int, it should work. Btw, spark and impala parquet read methods are different. Impala follows a more conservative way to read parquet files. For example; spark does not control the number of values in different column indices while the impala do.
Upvotes: 1