bbarker
bbarker

Reputation: 13108

How does one figure out the approximate number of keys in a table in Cassandra?

I've seen references to a ‘Number of key(estimate) from running nodetool cfstats, but at least in my system (Cassandra version 3.11.3), I don't see it:

           Table: XXXXXX
            SSTable count: 4
            Space used (live): 2393755943
            Space used (total): 2393755943
            Space used by snapshots (total): 0
            Off heap memory used (total): 2529880
            SSTable Compression Ratio: 0.11501749368144083
            Number of partitions (estimate): 1146
            Memtable cell count: 296777
            Memtable data size: 147223380
            Memtable off heap memory used: 0
            Memtable switch count: 127
            Local read count: 9
            Local read latency: NaN ms
            Local write count: 44951572
            Local write latency: 0.043 ms
            Pending flushes: 0
            Percent repaired: 0.0
            Bloom filter false positives: 0
            Bloom filter false ratio: 0.00000
            Bloom filter space used: 2144
            Bloom filter off heap memory used: 2112
            Index summary off heap memory used: 240
            Compression metadata off heap memory used: 2527528
            Compacted partition minimum bytes: 447
            Compacted partition maximum bytes: 43388628
            Compacted partition mean bytes: 13547448
            Average live cells per slice (last five minutes): NaN
            Maximum live cells per slice (last five minutes): 0
            Average tombstones per slice (last five minutes): NaN
            Maximum tombstones per slice (last five minutes): 0
            Dropped Mutations: 0

Is there some way to approximate select count(*) from XXXXXX with this version of Cassandra?

Upvotes: 3

Views: 451

Answers (2)

Aaron
Aaron

Reputation: 57798

This was changed with CASSANDRA-13722. The "number of keys" estimate always meant "number of partitions" anyway, this just makes it apparent.

To approximate the number of rows in a large table, you could take that value (number of partitions) as a starting point. Then approximate an average of the number of clustering key combinations (rows), and you should be able to make an educated guess at it.

Another thought, would figure out the size (in bytes) of one row. Then look at the P50 of the output of nodetool tablehistograms keyspacename.tablename:

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             2.00             35.43           4866.32               124                 1

Divide the P50 (50th percentile) of Partition Size by the size of one row. That should give you the average number of rows returned for that table. Then multiply that by the "number of partitions" and you should have your number for that node.

How does one get the size of one row in Cassandra?

$ bin/cqlsh 127.0.0.1 -u aaron -p yourPasswordSucks -e "SELECT * FROM system.local WHERE key='local';" > local.txt
$ ls -al local.txt
-rw-r--r--  1 z001mj8  DHC\Domain Users  2321 Sep 16 15:08 local.txt

Obviously, you'll want to take things out like pipe delimiters and the row header (not to mention accounting for the size difference in strings vs. numerics), but the final byte size of the file should put you in the ballpark.

Upvotes: 1

Jim Wartnick
Jim Wartnick

Reputation: 2196

The "number of keys" is the same as "the number of partitions" - again an estimate. If your partition key is the primary key (no clustering columns), then you'll have an estimate for the number of rows on that node. Otherwise, it's simply that, the estimate of number of partition key values.

-Jim

Upvotes: 1

Related Questions