How does spark copy data between cassandra tables?

Question

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.

Here is my use case:

I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.

Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:

ERROR Operation failed: ('Unable to complete the operation against any hosts', {: ConnectionException('Pool is shutdown',), : ConnectionException('Pool is shutdown',)})"

Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.

Here is the pyspark code I am using:

df = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table=sourcetable, keyspace=sourcekeyspace)\
    .load().cache()

df.createOrReplaceTempView("data")

query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value  from data  " )

vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table=newtable, keyspace=newkeyspace)\
    .save()

Versions:

Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11

Cluster:

Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.

So I am wondering:

Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?

How does spark copy data between cassandra tables?

Answers (1)

Related Questions