ghchoi
ghchoi

Reputation: 5156

What's the Meaning of "the primary key is the data" in Columnar DB

While researching column-oriented DB, I read "the primary key is the data" many times. (e.g., at Column-oriented DBMS)

I thought I can randomly access to any cell (in a certain column) by value because values, the data, are already indexed as primary key.

But after I put more than 3M rows into HBase, the HBase shell command

scan 'lottery', {COLUMNS => 'cf:status', FILTER => "ValueFilter(=, 'binary:win')"}

takes more than 3 seconds...

(It's getting slower and slower as more and more rows are put...)

'win' or 'lose' are two possible values for the column cf:status and there is only 1 row whose value is 'win'.

I might misunderstood...

What does "the primary key is the data" mean in column-oriented DB?

Thank you.

Upvotes: 3

Views: 671

Answers (1)

ThoG
ThoG

Reputation: 51

To be able to find something quickly with HBase, it needs to be a prefix of the rowkey. Therefore rowkey design is of great importance when building for speed. For your case, you could use the values 'lottery_win' or 'lottery_lose' in the beginning of the rowkey of every row. This would make the query scan 'lottery_win' query very fast (sub-second), even with hundreds of billions of rows.

Filters in HBase are usually not very fast as the filter looks at every row that matches you scan. Having a filter read through millions of rows is generally not a good idea if you want speed.

Primary key in DBMS doesn't imply anything about performance. It is a constraint on the records you can put into a table. What gives the speed is an index. An HBase table only has one indexed item - and that is the rowkey. No other columns are indexed and therefore filters are slow (in the order of millions of rows per second).

Upvotes: 1

Related Questions