Reputation: 599
I use:
I use tokens in Cassandra (not vnodes).
I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70 million rows and it is taking 15 minutes for it.
When I am reading data and checking number of partition of a RDD is somewhere around 21000 which is too large. How to control this number?
I have tried splitCount
, split.size.in.mbs
but they show me the same number of partitions.
Any suggestions?
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
import java.sql.DriverManager
import java.sql.Connection
object Hi {
def main(args: Array[String])
{
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "172.16.4.196").set("spark.cassandra.input.split.size_in_mb","64")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("cw","usedcareventsbydatecookienew")
println("hello world" + rdd.partitions)
println("hello world" + rdd.count)
}
}
this is my code for the reference. I run nodetool compact now i am able to control number of partition but still the whole process is taking almost 6 minutes which is i think is too high any suggestion for improvements
Upvotes: 3
Views: 2582
Reputation: 783
Are you looking for spark.cassandra.input.split.size?
spark.cassandra.input.split.size Default = 64. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
Upvotes: 4
Reputation: 599
My problem is solved when i run compact command on my cassandra table now i am able to control it using spark.cassandra.input.split.size parameter
Upvotes: 0