Andrew White
Andrew White

Reputation: 53516

Localizing HFile blocks in HDFS

We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).

Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.

I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.

Upvotes: 2

Views: 788

Answers (1)

Anil Gupta
Anil Gupta

Reputation: 1126

HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.

When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.

If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.

Upvotes: 1

Related Questions