Reputation: 8886
I am using HBase to store webtable content like how google is using bigtable.
For reference of google bigtable
My question is on RowKey, how we should be forming it.
What google is doing is saving the URL in a reverse order as you can see in the PDF document "com.cnn.www" so that all the links associated with cnn.com will be manages in same block of GFS which will be lot easier to scan.
I can use the same thing as google is using but wont it will be cool if I use some algorithm to compress the url
For eg.
RewKey | Google Bigtable | Algorithm output
www.cnn.com/index.php | com.cnn.www/index.php | 12as/435
www.cnn.com/news/business/index.html | com.cnn.www/news/business/index.html | 12as/2as/dcx/asd
www.cnn.com/news/sports/index.html | com.cnn.www/news/sports/index.html | 12as/2as/eds/scf
Reason behind doing this is rowkey will be shorter as per the Hbase design schema (Mentioned in topic 6.3.2.3. Rowkey Length).
So what do I need from you guys is to know am I correct over here....
Also if I am correct what Algorithm I should using. I am using python over thrift as a programming language so code will be overwhelming for me...
Upvotes: 0
Views: 615
Reputation: 25919
When you shorten the URI do it separately for the host and for the path and concatenate so your key would be something like hostHash!pathHash which will keep it short on one hand and group all the URIs from the same site together on the other
Upvotes: 1