Reputation: 167
Thanks for taking interest in my question. Before I begin, I'd like to let you know that I'm very new to Hadoop & HBase. So far, I find Hadoop very interesting and would like to contribute more in the future.
I'm primarily interested in improving performance of HBase. To do so, I had modified Writer
methods in HBase's /io/hfile/Hfile.java
in a way that it does high-speed buffered data assembly and then directly write to Hadoop so that it can later be loaded by HBase.
Now, I'm trying to come up with a way to compress key-value pairs so that bandwidth could be saved. I've done a lot of research to figure out how; and then realized that HBase has built-in compression libraries.
I'm currently looking at SequenceFile (1); setCompressMapOutput (2) (deprecated); and Class Compression (3). I also found a tutorial on Apache's MapReduce.
Could someone explain what "SequenceFile" is, and how I can implement those compression libraries and algorithms? These different classes and documents are so confusing to me.
I'd sincerely appreciate your help.
--
Hyperlinks:
(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
(2): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29
(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/Compression.html
Upvotes: 0
Views: 2298
Reputation: 970
SequenceFile
is a key/value pair file format implemented in Hadoop. Even though SequenceFile
is used in HBase for storing write-ahead logs, SequenceFile
's block compression implementation is not.
The Compression
class is part of Hadoop's compression framework and as such is used in HBase's HFile block compression.
HBase already has built-in compression of the following types:
SequenceFile
's built-in block compression cannot be used for write-ahead log, because buffering key/value pairs for block compression would cause data loss.HBase RPC compression is a work in progress. As you mentioned, compressing key/value pairs passed between client and HBase can save bandwidth and improve HBase performance. This has been implemented in Facebook's version of HBase, 0.89-fb (HBASE-5355) but has yet to be ported to the official Apache HBase trunk. RPC compression algorithms supported in HBase 0.89-fb are the same as those supported by the Hadoop compression framework (e.g. GZIP and LZO).
The setCompressedMapOutput
method is a map-reduce configuration method and is not really relevant to HBase compression.
Upvotes: 4
Reputation: 6182
A SequenceFile is a stream of key/value pairs used by Hadoop. You can read more about it on the Hadoop wiki.
Upvotes: 0