Reputation: 62
We have Cassandra table person,
CREATE TABLE test.person (
name text PRIMARY KEY,
score bigint
)
and Dataframe is,
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
In Spark We wanted to save dataframe to table , where dataframe is having multiple records for the same primary key.
Q 1: How Cassandra Connector internally handles ordering of the rows?
Q2: We are reading data from kafka and saving to Cassandra, and our batch will always have multiple events like above. We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Connector version we used is spark-cassandra-connector_2.12:3.2.1
Here are some Observation from our side,
val spark = SparkSession.builder()
.master("local[1]")
.appName("CassandraConnector")
.config("spark.cassandra.connection.host", "")
.config("spark.cassandra.connection.port", "")
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.getOrCreate()
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
caseClassDF.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", "test")
.option("table", "person")
.mode("APPEND")
.save()
.master("local[1]")
then in Cassandra table, we always see score 2000 for "Andy1" and 2700 fro "Ron", this is the latest in the Seq
.master("local[*]") OR .master("local[2]")
then we see some random score in Cassandra table, either 200 or 32 for "Andy1".
Note : We did each run on fresh table. So it is always insert and update in one batch.
We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Upvotes: 2
Views: 104
Reputation: 87184
Data in dataframe is by definition aren't ordered, and write into Cassandra will reflect this (inserts and updates are the same things in Cassandra) - data will be written in the random order and last write will win.
If you want to write only the latest value (with max score?) you will need to perform aggregations over your data, and use update
output mode to write data to Cassandra (to write intermediate results of your streaming aggregations). Something like this:
caseClassDF.groupBy("name").agg(max("score")).write....
Upvotes: 1