DataFrame write 10x slower than RDD save to cassandra in spark

Question

I tried comparing cassandra save for a table with 30,000 records for RDD and DataSet. I found that Dataset save was 10 times slower compared to RDD. The table has 4 partitioning keys.

 DSE Version :5.1.7
 Spark version :2.0.1
 Nodes:6( 20 cores each 6g)
 Using Spark Standalone

We used the following spark configurations:

spark.scheduler.listenerbus.eventqueue.size=100000
spark.locality.wait=1
spark.dse.continuous_paging_enabled=false
spark.cassandra.input.fetch.size_in_rows=500
spark.cassandra.connection.keep_alive_ms=10000
spark.cassandra.output.concurrent.writes=2000
num-cpu-cores=48
memory-per-node=3g
spark.executor.cores=3
spark.cassandra.output.ignoreNulls=true
spark.cassandra.output.throughput_mb_per_sec=10
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.cassandra.connection.local_dc=dc1
spark.cassandra.connection.compression=LZ4
spark.cassandra.connection.connections_per_executor_max=20

Following is the sample code for the same:

val sparkSession = SparkSession.builder().config(conf).getOrCreate()

import sparkSession.implicits._

val RDD1 = sc.cassandraTable[TableName]("keySpace1", "TableName")
           .where("id =?,id)

RDD1.saveToCassandra("keySpace1", "TableName")

var DS1 = sparkSession.read
           .format("org.apache.spark.sql.cassandra")
           .options(Map("table" -> "TableName", "keyspace" ->"keySpace1"))
           .load()
           .where("id ='"+ id +"'").as[CaseClassModel]

DS1.write.format("org.apache.spark.sql.cassandra")          
  .mode(SaveMode.Append).option("table", "TableName1")                
  .option("keyspace", "KeySpace1")
  .save()

DataFrame write 10x slower than RDD save to cassandra in spark

Answers (1)

Related Questions