Wasim Karani
Wasim Karani

Reputation: 8886

Hbase RowKey design schema

I am using HBase to store webtable content like how google is using bigtable.
For reference of google bigtable
My question is on RowKey, how we should be forming it.
What google is doing is saving the URL in a reverse order as you can see in the PDF document "com.cnn.www" so that all the links associated with cnn.com will be manages in same block of GFS which will be lot easier to scan.
I can use the same thing as google is using but wont it will be cool if I use some algorithm to compress the url

For eg.

RewKey                               |  Google Bigtable                      |  Algorithm output
www.cnn.com/index.php                |  com.cnn.www/index.php                |  12as/435
www.cnn.com/news/business/index.html |  com.cnn.www/news/business/index.html |  12as/2as/dcx/asd
www.cnn.com/news/sports/index.html   |  com.cnn.www/news/sports/index.html   |  12as/2as/eds/scf

Reason behind doing this is rowkey will be shorter as per the Hbase design schema (Mentioned in topic 6.3.2.3. Rowkey Length).

So what do I need from you guys is to know am I correct over here....
Also if I am correct what Algorithm I should using. I am using python over thrift as a programming language so code will be overwhelming for me...

Upvotes: 0

Views: 615

Answers (1)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25919

When you shorten the URI do it separately for the host and for the path and concatenate so your key would be something like hostHash!pathHash which will keep it short on one hand and group all the URIs from the same site together on the other

Upvotes: 1

Related Questions