deenbandhu
deenbandhu

Reputation: 599

How to control number of partition while reading data from Cassandra?

I use:

  1. cassandra 2.1.12 - 3 nodes
  2. spark 1.6 - 3 nodes
  3. spark cassandra connector 1.6

I use tokens in Cassandra (not vnodes).

I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70 million rows and it is taking 15 minutes for it.

When I am reading data and checking number of partition of a RDD is somewhere around 21000 which is too large. How to control this number?

I have tried splitCount, split.size.in.mbs but they show me the same number of partitions.

Any suggestions?

import org.apache.spark.{SparkContext, SparkConf} 
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
import java.sql.DriverManager
import java.sql.Connection


object Hi {
  def main(args: Array[String])
  {
    val conf = new  SparkConf(true).set("spark.cassandra.connection.host", "172.16.4.196").set("spark.cassandra.input.split.size_in_mb","64")
    val sc = new SparkContext(conf)

    val rdd =  sc.cassandraTable("cw","usedcareventsbydatecookienew")
    println("hello world" + rdd.partitions)
    println("hello world" + rdd.count)
  }

}

this is my code for the reference. I run nodetool compact now i am able to control number of partition but still the whole process is taking almost 6 minutes which is i think is too high any suggestion for improvements

Upvotes: 3

Views: 2582

Answers (2)

chaitan64arun
chaitan64arun

Reputation: 783

Are you looking for spark.cassandra.input.split.size?

spark.cassandra.input.split.size Default = 64. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.

Upvotes: 4

deenbandhu
deenbandhu

Reputation: 599

My problem is solved when i run compact command on my cassandra table now i am able to control it using spark.cassandra.input.split.size parameter

Upvotes: 0

Related Questions