Reputation: 683
How count "rows" in column family of Cassandra using python driver more effectively? I use following code:
from cassandra.cluster import Cluster
from sys import stdout
servers = ['server1', 'server2']
cluster = Cluster(servers)
session = cluster.connect()
result = session.execute('select * from ks1.t1')
count = 0
for i in result:
count += 1
print count
Upvotes: 1
Views: 4402
Reputation: 1290
To achieve this in Python, why not the following:
from cassandra.cluster import Cluster
servers = ['server1', 'server2']
cluster = Cluster(servers)
session = cluster.connect()
result = session.execute('select count(*) from ks1.t1')
count = 0
for row in result: # will only be 1 row
count += row.count
print(count)
Upvotes: 1
Reputation: 1381
Brian Hess has a stand alone 'cassandra-count'.
Simple program to count the number of records in a Cassandra table. By splitting the token range using the numSplits parameter, you can reduce the amount each query is counting and reduce the probability of timeouts.
It is true the Spark is well-suited to this operation, however the goal of this program is to be a simple utility that does not require Spark.
https://github.com/brianmhess/cassandra-count
Upvotes: 0
Reputation: 8812
Terrible way to count row. Basically you're doing a full table scan.
To count exact rows in a distributed system is hard.
You can have an estimate of the number of partitions (partition == row if you don't have clustering columns in your table) using nodetool tablestats/cfstats
If you absolutely need to have an exact count of the number of rows, use a co-located Spark install to fetch all data in Spark memory locally and then count them with Spark. This way the counting will be distributed and not overwhelm the coordinator.
Sample scala code:
import com.datastax.spark.connector._
sc.cassandraTable("keyspace", "table_name").count()
Upvotes: 0