Fast estimated count of rows in Cassandra table

Question

I am surprised that this question wasn't raised before.

Suppose that we have a huge table in cassandra and we need to obtain an estimated number of rows in it (not exact, just approximation).

Apparently simple select count(*) from table is not efficient and can take a lot of time. We need something dirty and quick.

Datastax blog suggests the following:

I don’t care about the exact number, can I have a ballpark estimate?

Because Cassandra knows how many rows there are in each SSTable it is possible to get an estimate. The ‘nodetool cfstats’ output tells you these counts in the ‘Number of Keys (estimate)’ line. This is the sum of rows in each SStable (again approximate due to the indexing used but can’t be off by more than 128 by default).

My question: can we perform the same operation using DataStax Enterprise Java driver?

P.S. I can not change tables structure or whatever. Consider I use a legacy schema. In other words, I am not interested in workarounds like an adding counter or other special fields.

dilsingi · Accepted Answer

Cassandra exposes the approximate count (obtained from "nodetool cfstats") also via JMX. The code can hook into this JMX metric, to get counts programmatically.

EstimatedPartitionCount Gauge Approximate number of keys in table.

 {
    "type": "READ",
    "mbean": "org.apache.cassandra.metrics:type=Table,keyspace=*,scope=*,name=*",
    "attribute": "Count"
  }

Here is a link on all JMX metrics exposed.

Fast estimated count of rows in Cassandra table

Answers (1)

Related Questions