Reputation: 729
I have created creating parquet file and then I am trying to import it into Impala table.
I created table as below:
CREATE EXTERNAL TABLE `user_daily` (
`user_id` BIGINT COMMENT 'User ID',
`master_id` BIGINT,
`walletAgency` BOOLEAN,
`zone_id` BIGINT COMMENT 'Zone ID',
`day` STRING COMMENT 'The stats are aggregated for single days',
`clicks` BIGINT COMMENT 'The number of clicks',
`impressions` BIGINT COMMENT 'The number of impressions',
`avg_position` BIGINT COMMENT 'The average position * 100',
`money` BIGINT COMMENT 'The cost of the clicks, in hellers',
`web_id` BIGINT COMMENT 'Web ID',
`discarded_clicks` BIGINT COMMENT 'Number of discarded clicks from column "clicks"',
`impression_money` BIGINT COMMENT 'The cost of the impressions, in hellers'
)
PARTITIONED BY (
year BIGINT,
month BIGINT
)
STORED AS PARQUET
LOCATION '/warehouse/impala/contextstat.db/user_daily/';
Then I copy files there with this schema:
parquet-tools schema user_daily/year\=2016/month\=8/part-r-00001-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet
message spark_schema {
optional int32 user_id;
optional int32 web_id (INT_16);
optional int32 zone_id;
required int32 master_id;
required boolean walletagency;
optional int64 impressions;
optional int64 clicks;
optional int64 money;
optional int64 avg_position;
optional double impression_money;
required binary day (UTF8);
}
And then when I try to see entries with
SELECT * FROM user_daily;
I get
File 'hdfs://.../warehouse/impala/contextstat.db/user_daily/year=2016/month=8/part-r-00000-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet'
has an incompatible Parquet schema for column 'contextstat.user_daily.user_id'.
Column type: BIGINT, Parquet schema:
optional int32 user_id [i:0 d:1 r:0]
Do you know how to solve this problem? I think that BIGINT is the same as int_32. Should I change scheme of table or generating of parquet files?
Upvotes: 3
Views: 7914
Reputation: 729
I use CAST(... AS BIGINT)
, which change parquet schema from int32
to int64
. Then I have to reorder of columns because it wont join then by name. Then it works.
Upvotes: 0
Reputation: 3115
BIGINT is int64, that's why it complains. But you don't necessarily have to figure out the different types that you have to use yourself, Impala can do that for you. Just use the CREATE TABLE LIKE PARQUET variant:
The variation CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file' lets you skip the column definitions of the CREATE TABLE statement. The column names and data types are automatically configured based on the organization of the specified Parquet data file, which must already reside in HDFS.
Upvotes: 3