Reputation: 5156
While researching column-oriented DB, I read "the primary key is the data" many times. (e.g., at Column-oriented DBMS)
I thought I can randomly access to any cell (in a certain column) by value because values, the data, are already indexed as primary key.
But after I put more than 3M rows into HBase, the HBase shell command
scan 'lottery', {COLUMNS => 'cf:status', FILTER => "ValueFilter(=, 'binary:win')"}
takes more than 3 seconds...
(It's getting slower and slower as more and more rows are put...)
'win'
or 'lose'
are two possible values for the column cf:status
and there is only 1 row whose value is 'win'
.
I might misunderstood...
What does "the primary key is the data" mean in column-oriented DB?
Thank you.
Upvotes: 3
Views: 671
Reputation: 51
To be able to find something quickly with HBase, it needs to be a prefix of the rowkey. Therefore rowkey design is of great importance when building for speed. For your case, you could use the values 'lottery_win'
or 'lottery_lose'
in the beginning of the rowkey of every row. This would make the query scan 'lottery_win'
query very fast (sub-second), even with hundreds of billions of rows.
Filters in HBase are usually not very fast as the filter looks at every row that matches you scan. Having a filter read through millions of rows is generally not a good idea if you want speed.
Primary key in DBMS doesn't imply anything about performance. It is a constraint on the records you can put into a table. What gives the speed is an index. An HBase table only has one indexed item - and that is the rowkey. No other columns are indexed and therefore filters are slow (in the order of millions of rows per second).
Upvotes: 1