Chul Kwon
Chul Kwon

Reputation: 167

HBase Key-Value Compression?

Thanks for taking interest in my question. Before I begin, I'd like to let you know that I'm very new to Hadoop & HBase. So far, I find Hadoop very interesting and would like to contribute more in the future.

I'm primarily interested in improving performance of HBase. To do so, I had modified Writer methods in HBase's /io/hfile/Hfile.java in a way that it does high-speed buffered data assembly and then directly write to Hadoop so that it can later be loaded by HBase.

Now, I'm trying to come up with a way to compress key-value pairs so that bandwidth could be saved. I've done a lot of research to figure out how; and then realized that HBase has built-in compression libraries.

I'm currently looking at SequenceFile (1); setCompressMapOutput (2) (deprecated); and Class Compression (3). I also found a tutorial on Apache's MapReduce.

Could someone explain what "SequenceFile" is, and how I can implement those compression libraries and algorithms? These different classes and documents are so confusing to me.

I'd sincerely appreciate your help.

--

Hyperlinks:

(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

(2): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29

(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/Compression.html

Upvotes: 0

Views: 2298

Answers (2)

mikhail_b
mikhail_b

Reputation: 970

SequenceFile is a key/value pair file format implemented in Hadoop. Even though SequenceFile is used in HBase for storing write-ahead logs, SequenceFile's block compression implementation is not.

The Compression class is part of Hadoop's compression framework and as such is used in HBase's HFile block compression.

HBase already has built-in compression of the following types:

  • HFile block compression on disk. This uses Hadoop's codec framework and supports compression algorithms such as LZO, GZIP, and SNAPPY. This type of compression is only applied to HFile blocks that are stored on disk, because the whole block needs to be uncompressed to retrieve key/value pairs.
  • In-cache key compression (called "data block encoding" in HBase terminology)—see HBASE-4218. Implemented encoding algorithms include various types of prefix and delta encoding, and trie encoding is being implemented as of this writing (HBASE-4676). Data block encoding algorithms take advantage of the redundancy between sorted keys in an HFile block and only store the differences between consecutive keys. These algorithms currently do not deal with values, and therefore are mostly useful for the case of small values (relative to key size), e.g. counters. Due to the light-weight nature of these data block encoding algorithms, it is possible to efficiently decode only the necessary part of the block to retrieve the requested key or advance to the next key. This is why these encoding algorithms are good for improving cache efficiency. However, on some real-world datasets delta encoding also allows to save up to 50% on top of LZO compression (e.g. applying delta encoding and then LZO vs. LZO only), thus achieving significant savings on disk as well.
  • A custom dictionary-based write-ahead log compression approach is implemented in HBASE-4608. Note: even though SequenceFile is used for write-ahead log storage in HBase, SequenceFile's built-in block compression cannot be used for write-ahead log, because buffering key/value pairs for block compression would cause data loss.

HBase RPC compression is a work in progress. As you mentioned, compressing key/value pairs passed between client and HBase can save bandwidth and improve HBase performance. This has been implemented in Facebook's version of HBase, 0.89-fb (HBASE-5355) but has yet to be ported to the official Apache HBase trunk. RPC compression algorithms supported in HBase 0.89-fb are the same as those supported by the Hadoop compression framework (e.g. GZIP and LZO).

The setCompressedMapOutput method is a map-reduce configuration method and is not really relevant to HBase compression.

Upvotes: 4

Spike Gronim
Spike Gronim

Reputation: 6182

A SequenceFile is a stream of key/value pairs used by Hadoop. You can read more about it on the Hadoop wiki.

Upvotes: 0

Related Questions