Reputation: 3426
I'm importing HFiles into HBase using the command:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles my_table
When I just had a look into the HBase Master UI, I saw that all data seems to be stored on one region:
The HFiles were created by a Spark application, using this command:
JavaPairRDD<String, MyEntry> myPairRDD = ...
myPairRDD .repartitionAndSortWithinPartitions(new HashPartitioner(hbaseRegions));
Why is the data not splitted into all regions?
Upvotes: 1
Views: 985
Reputation: 29155
Why is the data not splitted into all regions?
From the above picture seems like your rowkeys are not salted properly before loading in to hbase. so at source table it self its loading in to one particular region.
So your rdd will carry the the number of source partitions which caused hotspotting
Look at Rowkey design from hbase docs
So I would suggest while table creation it self pre-split in to number of regions may be 0 to 10 and then append prefix between 0-10 to row key would ensure uniform distribution of data.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases. & Data will be spread across region server.
Also look at at my answer
Upvotes: 4