Reputation: 673
I have recently switched to Hbase from rdbms for handling millions of records.. But as a newbie I am not sure what is the efficient way of designing Hbase scheme. Actually, scenario is I have text files which have hundred, thousands and millions records that I have to read and store into Hbase. So, there are two set of text files(RawData File, Label File) which are linked to each other as they belong to same user, for these files I have made two separate tables(RawData and Label) and I am storing their info there. So RawData file and RawData table look like this:
So you can see in my RawData table I have row key which is actually a file name of text file( 01-01-All-Data.txt) with the row number of each row of textfile. And column family is just random 'r' and column qualifiers are the columns of text files and value are the values of column. This is how I am inserting record in my table and I have third table(MapFile) where I store name of textfile as a row key user id of user as column qualifier and total number of records of textfile as value which looks like this:
01-01-All-Data.txt column=m:1, timestamp=1375189274467, value=146209
I will use Mapfile table in order to read RawData table row by row..
What is your suggestion about this kind Hbase Schema? Is it a proper way? or it doesn't make sense in Hbase concepts?
Furthermore, It worths to mention that it is taking around 3 mins in inserting 21 mbs file with 146207 rows in Hbase.
Please Advice.
Thanks
Upvotes: 2
Views: 4350
Reputation: 34184
Although I don't find anything wrong with your current schema, it's appropriate or not can be decided only after analyzing your use case and frequent access pattern. Correct is not always appropriate, IMHO. Since I don't have any idea about all this my suggestions may sound incorrect. Please let me know if that is the case. I'll update the answer accordingly. Here we go,
Does it make sense(keeping your data and access pattern in mind) to have just one table with 3 column families :
Use the userid as rowkey. It will be unique and doesn't look very lengthy. With this design you could bypass the overhead of shunting from one table to another while fetching the data.
Few more suggestions :
Furthermore, It worths to mention that it is taking around 3 mins in inserting 21 mbs file with 146207 rows in Hbase.
How are you inserting your data?MapReduce or normal Java+HBAse API?What is your cluster size?Configuration and specs?
You might find these links useful :
HTH
Upvotes: 6