Reputation: 806
Using : Amazon Aws Hive (0.13)
Trying to : output orc files with snappy compression.
create external table output{
col1 string}
partitioned by (col2 string)
stored as orc
location 's3://mybucket'
tblproperties("orc.compress"="SNAPPY");
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output = true;
set mapred.output.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table output
partition(col2)
select col1,col2 from input;
The problem is that, when I look at the output in the mybucket directory, it is not with SNAPPY extension. However, it is a binary file though. What setting am I missing out to convert these orc file to be compressed and output with a SNAPPY extension ?
Upvotes: 1
Views: 10973
Reputation: 584
Additionally, you can use hive --orcfiledump /apps/hive/warehouse/orc/000000_0
to see the details of your file. The output will look like:
Reading ORC rows from /apps/hive/warehouse/orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 6
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
File Statistics:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
....
Upvotes: 3
Reputation: 2181
OrcFiles are binary files that are in a specialized format. When you specify orc.compress = SNAPPY
the contents of the file are compressed using Snappy. Orc is a semi columnar file format.
Take a look at this documentation for more information about how data is laid out.
Streams are compressed using a codec, which is specified as a table property for all streams in that table To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
In short, your files are compressed using Snappy codec, you just can't tell that they are because the blocks inside the file are what's actually compressed.
Upvotes: 3