Jack
Jack

Reputation: 5870

why HBase count operation so slow

The command is:

count 'tableName'. 

It's very slow to get the total row number of the whole table.

My situation is:

I'm very curious why hbase so slow on this operation, I guess it's even slower then mysql. Is Cassandra faster than Hbase on these operations?

Upvotes: 5

Views: 2716

Answers (2)

Rubén Moraleda
Rubén Moraleda

Reputation: 3067

First of all, please remind that to make use of data locality, your "slaves" (better known as RegionServers) must have also the DataNode role, not doing so is a performance killer.

Due performance reasons HBase does not mantain a live counter of rows. To perform a count the HBase shell client needs to retrieve all the data, and that means that if your average row has 5M of data, then the client would retrieve 5M * 1550 from the regionservers just to count, which is a lot.

To speed it up you have 2 options:

  • If you need realtime responses you can maintain your own live counter of rows making use of HBase atomic counters: each time you insert you increment the counter, and each time you delete you decrement the counter. It can even be in the same table, just use another column family to store it.

  • If you don't need realtime run a distributed row counter map-reduce job (source code) forcing the the scan to just use the smallest column family & column available to avoid reading big rows, each RegionServer will read the locally stored data and no network I/O will be required. In this case you may need to add a new column to your rows with a small value if you don't have one yet (a boolean is your best option).

Upvotes: 3

Anil Gupta
Anil Gupta

Reputation: 1126

First of all, you have very small amount of data. If you have that kind of volume, then IMO using NoSql would provide you no advantage. Your test is not appropriate to judge performance of HBase and Cassandra. Both have their own use cases and sweet spots.

count command on hbase is running a single threaded java program to do counts of rows. Still, I am surprised that its taking 2 mins to count 1550 rows. If you would like to do counts in faster way(for bigger dataset) then you should run MapReduce job of HBase Row_Counter.
Run MapReduce job by running this:

bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter

Upvotes: 7

Related Questions