Hive Compression Orc in Snappy

Question

Using : Amazon Aws Hive (0.13)
Trying to : output orc files with snappy compression.

create external table output{
col1 string}
partitioned by (col2 string)
stored as orc
location 's3://mybucket'
tblproperties("orc.compress"="SNAPPY");

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output = true;    
set mapred.output.compression.type = BLOCK;  
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

insert into table output
partition(col2)
select col1,col2 from input;

The problem is that, when I look at the output in the mybucket directory, it is not with SNAPPY extension. However, it is a binary file though. What setting am I missing out to convert these orc file to be compressed and output with a SNAPPY extension ?

Pradeep Gollakota · Accepted Answer

OrcFiles are binary files that are in a specialized format. When you specify orc.compress = SNAPPY the contents of the file are compressed using Snappy. Orc is a semi columnar file format.

Take a look at this documentation for more information about how data is laid out.

Streams are compressed using a codec, which is specified as a table property for all streams in that table To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.

In short, your files are compressed using Snappy codec, you just can't tell that they are because the blocks inside the file are what's actually compressed.

Hive Compression Orc in Snappy

Answers (2)

Related Questions