Reputation: 11
I need to use parquet files in order to do some analysis and enrich them with information. However, I am cannot do anything because of incompatible column datatypes (unsigned integers).
I tried to use Apache-drill but the same error occurs when I want to execute some queries.
Here is the query I want to use to begin with and the error message : SELECT * from dfs.'/home/artyom/addresses.parquet' LIMIT 5;
Error: INTERNAL_ERROR ERROR: Error in parquet record reader.
Message:
Hadoop path: /home/artyom/addresses.parquet/part.0.parquet
Total records read: 0
Row group index: 0
Records in row group: 34369585
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema
{
optional binary ip (UTF8);
optional int64 ip_id (UINT_64);
optional int32 reputation (UINT_8);
optional int32 confidence (UINT_8);
optional float queries_ratio;
(...)
}
But queries like : SELECT ip from dfs.'/home/artyom/addresses.parquet' LIMIT 5;
or SELECT queries_ratio from dfs.'/home/artyom/addresses.parquet' LIMIT 5;
work like a charm.
Only the unsigned integers columns are a problem.
I read the apache-drill documentation about converting datatypes and tried several things but without success.
Could someone help me with this and tell me if there is a way to convert the UINT_X into compatible INTEGER types ? The conversion from unsigned integer to interger will not be a problem for the data. I just need to find out how to modify the column datatypes of the parquet file. Thanks a lot!
Upvotes: 1
Views: 993
Reputation: 671
As a workaround, another Parquet reader can be used: use store.parquet.use_new_reader = true;
.
Issue for the default reader will be fixed in Drill 1.17.0 (see https://issues.apache.org/jira/browse/DRILL-5983 for details).
Upvotes: 1
Reputation: 661
Could you please clarify, which version of Drill you are using? Looks like the issue with reading UINT type was fixed in the scope of DRILL-4764 and DRILL-5971.
So it should work on Drill 1.14 and later.
Upvotes: 1