Tom
Tom

Reputation: 6342

Lz4 compression is not splittable

I am using lz4 compression and write data to a hive table, this table has 20 files and each is 15G on HDFS, and every file name of this table are ending with lz4, eg, part-m-00000.lz4.

When I run select count(1) from this table, it kicks off only 20 mappers, which mean lz4 splittable doesn't take effect.

It is said that lz4 supports splittable against text file,so I would ask what I should do or additional steps to enable this.

Upvotes: 2

Views: 1054

Answers (1)

Cyan
Cyan

Reputation: 13948

Assuming you can have some control on how data is being compressed, this codec might be closer to what you need, since it embeds a splittable layer. It's designed for use with Hadoop.

If you can't change the format, and it was compressed as a single stream with no jump-table, then I'm afraid there is no good solution. lz4 CLI will, by default, split data into blocks of 4 MB, but does not provide any jump table. The jump table is what makes an archive easy to read in random order. Without it, it's necessary to stream the data, and distribute the blocks in order for later processing.

Upvotes: 1

Related Questions