How to update Cassandra table with latest row, where Spark Dataframe is having multiple rows with same primary key?

Question

We have Cassandra table person,

CREATE TABLE test.person (
    name text PRIMARY KEY,
    score bigint
)

and Dataframe is,

val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()

In Spark We wanted to save dataframe to table , where dataframe is having multiple records for the same primary key.

Q 1: How Cassandra Connector internally handles ordering of the rows?

Q2: We are reading data from kafka and saving to Cassandra, and our batch will always have multiple events like above. We want to save the latest score to Cassandra. Any suggestion how we can achieve this??

Connector version we used is spark-cassandra-connector_2.12:3.2.1

Here are some Observation from our side,

val spark = SparkSession.builder()
    .master("local[1]")
    .appName("CassandraConnector")
    .config("spark.cassandra.connection.host", "")
    .config("spark.cassandra.connection.port", "")
    .config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
    .getOrCreate()


val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()

caseClassDF.write
      .format("org.apache.spark.sql.cassandra")
      .option("keyspace", "test")
      .option("table", "person")
      .mode("APPEND")
      .save()

When we have

.master("local[1]")

then in Cassandra table, we always see score 2000 for "Andy1" and 2700 fro "Ron", this is the latest in the Seq

Now when we change to,

.master("local[*]") OR .master("local[2]")

then we see some random score in Cassandra table, either 200 or 32 for "Andy1".

Note : We did each run on fresh table. So it is always insert and update in one batch.

We want to save the latest score to Cassandra. Any suggestion how we can achieve this??

How to update Cassandra table with latest row, where Spark Dataframe is having multiple rows with same primary key?

Answers (1)

Related Questions