Reputation: 597
I understand shell command count will give the count/number of rows in table. But what INTERVAL and CACHE denoted here?. I checked the web. Almost all the websites have the same explanation as
"Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows. If our rows are small in size, you may want to increase this parameter. Examples:"
I do not understand what they are explaining.
hbase> COUNT 't1', INTERVAL => 100000
hbase> COUNT 't1', CACHE => 1000
hbase> COUNT 't1', INTERVAL => 10, CACHE => 1000
Can anybody explain in easy way?
Upvotes: 4
Views: 6209
Reputation: 34704
@MallowFox explained COUNT
well.
Caching, however, is a bit more confusing. Why would caching make counting faster? It doesn't need to remember the rows it counted. All that matters are how many rows and not their content.
It turns out caching is a bit of a misnomer and caching should more appropriately be named buffer or batch size. It's the number of rows coming back for each RPC to HBase. If the number is too low, your overhead can increase and the count could become much slower.
More about this here:
https://stackoverflow.com/a/22547731/492773
Upvotes: 2
Reputation: 3305
You can just use a large table(more than 2000 rows) to run the count
command, and you can see how they work.
As count
operation may take a LONG time, so it will print the current result on and on, like this:
Current count: 1000, row: ...
Current count: 2000, row: .....
Current count: 3000, row: ....
So if the INTERVAL is 1000, it will print when ever the count process get 1000.
And Cache
is just cache of scan
command. Basically, the count process will be faster if increasing the cache config, but will cost more memory, so it says:
If your rows are small in size, you may want to increase this parameter.
Upvotes: 3