How to repartition CassandraRDD in Apache Spark

Question

I am using Three Node Cassandra cluster with Six Spark Workers, each has 1 Core and 2GB RAM. Using Spark Application, I am trying to fetch entire data from Cassandra Table which has more than 300k rows and trying to do some aggregation.

But it is taking a lots of time to fetch data from Cassandra. I also went through Spark UI, I have seen that Spark stage has 3 partition, in which two are executing very fast (within seconds) but third one taking long time (7 min).

And I also tried to repartition CassandraRDD to increase number of tasks and distribute tasks to all six workers but didn't find out any solution.

RussS · Accepted Answer

To adjust the number of tasks created by the CassandraRDD you need to adjust the spark.cassandra.input.split.size. This determines how many actual Spark Partitions will be made.

spark.cassandra.input.split.size    approx number of Cassandra partitions in a Spark partition  100000

Note that this controls the number of C* partitions, not C* rows in a spark partition. It is also an estimate so you can't be guaranteed that this exact number of tokens will be in a spark Partition.

If you continue to see some Partitions acting slower than others I would investigate the node's health for that Partition, and check for hotspots.

How to repartition CassandraRDD in Apache Spark

Answers (1)

Related Questions