Reputation: 542
I am running spark application which performs a direct join on Cassandra table
I am trying to control the number of reads per sec, so that the long-running job doesn't impact the overall database Here are my configuration parameters
--conf spark.cassandra.concurrent.reads=2
--conf spark.cassandra.input.readsPerSec=2
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.cassandra.input.fetch.sizeInRows=1500
I know I won't read more than 1500 rows from each partition However, in spite of all thresholds reads per sec are crossing 200-300
Is there any other flag or configuration that needs to be turned on
Upvotes: 0
Views: 667
Reputation: 76
it seams that CassandraJoinRDD
has a bug in throttling with spark.cassandra.input.readsPerSec
, see https://datastax-oss.atlassian.net/browse/SPARKC-627 for details.
In the meantime use spark.cassandra.input.throughputMBPerSec
to throttle your join. Note that the throttling is based on RateLimiter class, so the throttling won't kick in immediately (you need to read at least throughputMBPerSec
of data to start throttling). This is something that may be improved in the SCC.
Upvotes: 1