Hbase Scheme design- Best Practice

Question

I have recently switched to Hbase from rdbms for handling millions of records.. But as a newbie I am not sure what is the efficient way of designing Hbase scheme. Actually, scenario is I have text files which have hundred, thousands and millions records that I have to read and store into Hbase. So, there are two set of text files(RawData File, Label File) which are linked to each other as they belong to same user, for these files I have made two separate tables(RawData and Label) and I am storing their info there. So RawData file and RawData table look like this:

enter image description here

So you can see in my RawData table I have row key which is actually a file name of text file( 01-01-All-Data.txt) with the row number of each row of textfile. And column family is just random 'r' and column qualifiers are the columns of text files and value are the values of column. This is how I am inserting record in my table and I have third table(MapFile) where I store name of textfile as a row key user id of user as column qualifier and total number of records of textfile as value which looks like this:

            01-01-All-Data.txt       column=m:1, timestamp=1375189274467, value=146209

I will use Mapfile table in order to read RawData table row by row..

What is your suggestion about this kind Hbase Schema? Is it a proper way? or it doesn't make sense in Hbase concepts?

Furthermore, It worths to mention that it is taking around 3 mins in inserting 21 mbs file with 146207 rows in Hbase.

Please Advice.

Thanks

Tariq · Accepted Answer

Although I don't find anything wrong with your current schema, it's appropriate or not can be decided only after analyzing your use case and frequent access pattern. Correct is not always appropriate, IMHO. Since I don't have any idea about all this my suggestions may sound incorrect. Please let me know if that is the case. I'll update the answer accordingly. Here we go,

Does it make sense(keeping your data and access pattern in mind) to have just one table with 3 column families :

RD - For RawData File which will have all the columns of this file
LF - For Label File with all the columns of this file, and
MF - For MapFile having one column holding number of records of your textfile.

Use the userid as rowkey. It will be unique and doesn't look very lengthy. With this design you could bypass the overhead of shunting from one table to another while fetching the data.

Few more suggestions :

If userids are monotonically increasing then hash your rowkeys so that you don't suffer from RegionServer Hotspotting.
You could also create pre-splitted tables in order to get better distribution.
Shorten the column names if possible.
Keep the number of version as low as possible.

Furthermore, It worths to mention that it is taking around 3 mins in inserting 21 mbs file with 146207 rows in Hbase.

How are you inserting your data?MapReduce or normal Java+HBAse API?What is your cluster size?Configuration and specs?

You might find these links useful :

HTH

Hbase Scheme design- Best Practice

Answers (1)

Related Questions