Reputation: 2747
In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh " in impala.
But when I want to do it with compressed files, it didn't work.
The only document I've found about enable compress with avro tables is this link: http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_avro.html#avro_compression_unique_1 .
Here is what I do:
Enable Hive compress in hive shell:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=bzip2;
Create a table:
CREATE TABLE log_bzip2
(
timestamp
bigint COMMENT 'from deserializer',
appid
string COMMENT 'from deserializer',
clientid
string COMMENT 'from deserializer',
statkey
string COMMENT 'from deserializer',
expid
string COMMENT 'from deserializer',
modid
string COMMENT 'from deserializer',
value
double COMMENT 'from deserializer',
summary
string COMMENT 'from deserializer',
custom
string COMMENT 'from deserializer')
PARTITIONED BY (
day
string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');
Load the compressed AVRO file into HDFS:
hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
Add partition and refresh in Impala shell:
alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';
refresh log_bzip2;
Query it but not work:
select * from log_bzip2 limit 10; Query: select * from log_bzip2 limit 10
WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '
How can I do it right? Thanks!
Upvotes: 1
Views: 842
Reputation: 2747
It turns out the avro format has its own way to compress data instead of compress the generated avro file manually. So what we need to do is add the compress option to the AVRO writer while writing the file, then the generated file is compressed by the avro encoder. Put this file into Hive is OK. Nothing else need to config.
Upvotes: 0