How to obtain row count estimates in in Cassandra using the Java client driver

Question

If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.

I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?

Aside from not including cluster keys, what's wrong with using this as a very rough estimate?

select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';

Chris Lohfink · Accepted Answer

The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).

This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.

Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.

How to obtain row count estimates in in Cassandra using the Java client driver

Answers (1)

Related Questions