Reputation: 3179
I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have a situation , for auditing purpose I need to calculate the table row count of C* table. I have around 2 billion records in my C* table.
To count rows I tried both ways as shown below .
public static Long getColumnFamilyCountJavaApi(SparkSession spark,String keyspace, String columnFamilyName) throws IOException{
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
return javaFunctions(sc).cassandraTable(keyspace, columnFamilyName).cassandraCount();
}
public static Long getColumnFamilyCount(SparkSession spark,String keyspace, String columnFamilyName) throws IOException{
return spark
.read()
.format("org.apache.spark.sql.cassandra")
.option("table", columnFamilyName)
.option("keyspace",keyspace )
.load().count();
}
But both ways resulting the same error.
Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded, 2 failed)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:85)
com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.spark.connector.cql.DefaultScanner.scan(Scanner.scala:34)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:342)
How to handle this scenario?
Upvotes: 0
Views: 614
Reputation: 2158
That error stack is a read timeout to the nodes. This could actually be due to a number of reasons. Rather than answer this particular error, I'm going to answer in the context of what your end goal is here.
You are trying to count rows in a table in Cassandra.
While this isn't an unreasonable request, for Cassandra this is a bit of a tricky topic. This is because counting is cluster wide. See this rather good blog article on why this is so.
I can see you're using spark here so you are likely already aware that counting in CQLSH can be expensive. You might want to have a look at the academy video here for cassandraCount
also see the spark connector docs
You might also be interested in the DSbulk tool . I've used this tool successfully for a number of things from large data migrations to small jobs like counts etc. See the DSbulk docs here
Hope this helps some!
Upvotes: 3