hbase how to choose pre split strategies and how its affect your rowkeys

Question

I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)

Is there any recommendation on choosing startkey and endkey based on dataset?

My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.

Does this approach to find startkey and rowkey looks ok or is there better practice?

EDIT I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.

PS> I am displaying only some region info

1)

byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);

This gives regions with boundaries like:

{
    "startkey":"-INFINITY",
    "endkey":"11111111",
    "numberofrows":3628951,
},
{
    "startkey":"11111111",
    "endkey":"22222222",
},
{   
    "startkey":"22222222",
    "endkey":"33333333",
},
{
    "startkey":"33333333",
    "endkey":"44444444",
},
{
    "startkey":"88888888",
    "endkey":"99999999",
},
{
    "startkey":"99999999",
    "endkey":"aaaaaaaa",
},
{
    "startkey":"aaaaaaaa",
    "endkey":"bbbbbbbb",
},
{
    "startkey":"eeeeeeee",
    "endkey":"INFINITY",
}

This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.

2)

byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)

This has same issue:

{
    "startkey":"-INFINITY",
    "endkey":"\x19\x99\x99\x99\x99\x99\x99\x99",
}
{
    "startkey":"\x19\x99\x99\x99\x99\x99\x99\
    "endkey":"33333332",
}
{
    "startkey":"33333332",
    "endkey":"L\xCC\xCC\xCC\xCC\xCC\xCC\xCB",
}
{
    "startkey":"\xE6ffffffa",
    "endkey":"INFINITY",
}

3) I tried supplying start key and end key and got following useless regions.

hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
                               Bytes.toBytes("01253|201501|805|1999"), 10);
{
    "startkey":"-INFINITY",
    "endkey":"04120|200808|805|1999",
}
{
    "startkey":"04120|200808|805|1999",
    "endkey":"000PTP\xDC200W\xD07\x9C805|1999",
}
{
    "startkey":"000PTP\xDC200W\xD07\x9C805|1999",
    "endkey":"000ptq<200wp6\xBC805|1999",
}
{
    "startkey":"001\x11\x15\x13\x1C201\x15\x902\x5C805|1999",
    "endkey":"01253|201501|805|1999",
}
{
    "startkey":"01253|201501|805|1999",
    "endkey":"INFINITY",
}

Ram Ghadiyaram · Accepted Answer

First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.

But underlying thing is,

With your rowkey design, data should be distributed across the regions and not hotspotted (36.1. Hotspotting)

However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.

In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).

Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting. like

EDIT : If you want to implement own regionSplit you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override

public byte[][] split(int numberOfSplits)

Second question : My understanding : You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.

If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..
you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.
to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter

for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()' which displays all your 100 rowkeys.

if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..

echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt

hbase how to choose pre split strategies and how its affect your rowkeys

Answers (1)

Related Questions