BdEngineer
BdEngineer

Reputation: 3179

Cassandra count query throwing ReadFailureException

I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have a situation , for auditing purpose I need to calculate the table row count of C* table. I have around 2 billion records in my C* table.

To count rows I tried both ways as shown below .

public static Long getColumnFamilyCountJavaApi(SparkSession spark,String keyspace, String columnFamilyName)  throws IOException{
  JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
  return javaFunctions(sc).cassandraTable(keyspace, columnFamilyName).cassandraCount();
}

public static Long getColumnFamilyCount(SparkSession spark,String keyspace, String columnFamilyName)  throws IOException{
  return spark
              .read()
              .format("org.apache.spark.sql.cassandra")
              .option("table", columnFamilyName)
              .option("keyspace",keyspace )
              .load().count();
} 

But both ways resulting the same error.

   Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded, 2 failed)
            at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:85)
    com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
            at com.datastax.spark.connector.cql.DefaultScanner.scan(Scanner.scala:34)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:342)

How to handle this scenario?

Upvotes: 0

Views: 614

Answers (1)

markc
markc

Reputation: 2158

That error stack is a read timeout to the nodes. This could actually be due to a number of reasons. Rather than answer this particular error, I'm going to answer in the context of what your end goal is here.

You are trying to count rows in a table in Cassandra.

While this isn't an unreasonable request, for Cassandra this is a bit of a tricky topic. This is because counting is cluster wide. See this rather good blog article on why this is so.

I can see you're using spark here so you are likely already aware that counting in CQLSH can be expensive. You might want to have a look at the academy video here for cassandraCount also see the spark connector docs

You might also be interested in the DSbulk tool . I've used this tool successfully for a number of things from large data migrations to small jobs like counts etc. See the DSbulk docs here

Hope this helps some!

Upvotes: 3

Related Questions