Reputation: 586
I have very huge Cassandra table containing over 1 billion records. My primary key forms like this: "(partition_id, cluster_id1, cluster_id2)
". Now for several particular partition_id, I have too many records that I can't run row count on these partition keys without timeout exception raised.
What I ran in cqlsh is:
SELECT count(*) FROM relation WHERE partition_id='some_huge_partition';
I got this exception:
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I tried to set --connect-timeout
and --request-timeout
, no luck. I counted same data in ElasticSearch, the row count is approximately 30 million (the same partition).
My Cassandra is 3.11.2 and CQLSH is 5.0.1. The Cassandra cluster contains 3 nodes and each has more 1T HDD(fairly old servers, more than 8 years).
So in short, my questions are:
Big thanks advanced.
Upvotes: 1
Views: 2276
Reputation: 586
I've found that with Spark and the awesome Spark Cassandra Connector library, I can finally count a large table without encountering any of the timeout limitations. The Python Spark code is like this:
tbl_user_activity = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace='ks1', table='user_activity').load()
tbl_user_activity.where('id = 1').count()
It will run for a while but in the end it works.
Upvotes: 0
Reputation: 57748
Yes, working with large partitions is difficult with Cassandra. There really isn't a good way to monitor particular partition sizes, although Cassandra will warn about writing large partitions in your system.log
. Unbound partition growth is something you need to address during the creation of your table, and it involves adding an additional (usually time based) partition key derived from understanding your business use case.
The answer here, is that you may be able to export the data in the partition using the COPY
command. To keep it from timing out, you'll want to use the PAGESIZE
and PAGETIMEOUT
options, kind of like this:
COPY products TO '/home/aploetz/products.txt'
WITH DELIMITER='|' AND HEADER=true
AND PAGETIMEOUT=40 AND PAGESIZE=20;
That will export the products
table to a pipe-delimited file, with a header, at a page size of 20 rows at a time and with a 40 second timeout for each page fetch.
If you still get timeouts, try decreasing PAGESIZE
and/or increasing PAGETIMEOUT
.
Upvotes: 2