user1098798
user1098798

Reputation: 381

Could I improve HBase performance by reducing the hdfs block size?

I have approximately 2500 tables involved in a calculation. In my dev environment I have very little data in these tables, 10 - 10,000 rows with most tables at the lower end of this range. My calculation will scan all these tables many times. Although the entire dataset would fit in memory easily accessing it through HBase is incredibly slow, with a huge amount of disk activity.

Do you think it would help to reduce the hdfs block size? My reasoning is that if each table is in its own block then a huge amount of memory would be wasted, preventing the entire dataset residing in RAM. A greatly reduced block size would allow the system to hold most if not all the data in RAM. Currently the block size is 64MB.

The final system will be used in larger cluster with far more memory and nodes, this is purely to speed up my dev environment.

Upvotes: 3

Views: 2403

Answers (2)

ozhang
ozhang

Reputation: 171

if your block size is too small then you need more memory to keep block indices. if block size is too big then HBase has to scan more line to detect searched key exist in HBase block or not. If your KV pair is 100 byte then 640 KVs fit into a block which is a good value.

Upvotes: 0

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25939

HBase store its data in HFiles (which are in turn stored inside Hadoop files) here's an excerpt from the doc:

Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB.

regardless of the block size you may want to set the tables' column families to be in-memory true which makes hbase favor keeping them in the cache.

Lastly you situation seems to be more appropriate for a cache like redis/memcache than Hbase, but maybe I don't have enough context

Upvotes: 4

Related Questions