How to import compressed AVRO files to Impala table?

Question

In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh " in impala.

But when I want to do it with compressed files, it didn't work.

The only document I've found about enable compress with avro tables is this link: http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_avro.html#avro_compression_unique_1 .

Here is what I do:

Enable Hive compress in hive shell:

hive> set hive.exec.compress.output=true;

hive> set avro.output.codec=bzip2;
Create a table:

CREATE TABLE log_bzip2( timestamp bigint COMMENT 'from deserializer', appid string COMMENT 'from deserializer', clientid string COMMENT 'from deserializer', statkey string COMMENT 'from deserializer', expid string COMMENT 'from deserializer', modid string COMMENT 'from deserializer', value double COMMENT 'from deserializer', summary string COMMENT 'from deserializer', custom string COMMENT 'from deserializer') PARTITIONED BY ( day string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');
Load the compressed AVRO file into HDFS:

hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
Add partition and refresh in Impala shell:

alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';

refresh log_bzip2;
Query it but not work:

select * from log_bzip2 limit 10; Query: select * from log_bzip2 limit 10

WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '

How can I do it right? Thanks!

Bin Wang · Accepted Answer

It turns out the avro format has its own way to compress data instead of compress the generated avro file manually. So what we need to do is add the compress option to the AVRO writer while writing the file, then the generated file is compressed by the avro encoder. Put this file into Hive is OK. Nothing else need to config.

How to import compressed AVRO files to Impala table?

Answers (1)

Related Questions