Spark Cassandra connector control number of reads per sec

Question

I am running spark application which performs a direct join on Cassandra table

I am trying to control the number of reads per sec, so that the long-running job doesn't impact the overall database Here are my configuration parameters

--conf spark.cassandra.concurrent.reads=2
--conf spark.cassandra.input.readsPerSec=2
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.cassandra.input.fetch.sizeInRows=1500

I know I won't read more than 1500 rows from each partition However, in spite of all thresholds reads per sec are crossing 200-300

Is there any other flag or configuration that needs to be turned on

Jarek · Accepted Answer

it seams that CassandraJoinRDD has a bug in throttling with spark.cassandra.input.readsPerSec, see https://datastax-oss.atlassian.net/browse/SPARKC-627 for details. In the meantime use spark.cassandra.input.throughputMBPerSec to throttle your join. Note that the throttling is based on RateLimiter class, so the throttling won't kick in immediately (you need to read at least throughputMBPerSec of data to start throttling). This is something that may be improved in the SCC.

Spark Cassandra connector control number of reads per sec

Answers (1)

Related Questions