Spark To Cassandra: Writing Sparse Rows With No Null Values To Cassandra

Question

Q: How do I write only columns with values from a Spark DataFrame into Cassanrda and do this efficiently? (efficiently as in minimal lines of Scala code and not creating a bunch of tombstones in Cassandra, having it run quickly, etc)

I have a Cassandra table with two key columns and 300 potential descriptor values.

create table sample {
    key1   text,
    key2   text,
    0      text,
    ............
    299    text,
    PRIMARY KEY (key1, key2)
}

I have a Spark dataframe that matches the underlying table but each row in the dataframe is very sparse - other than the two key values, a particular row may have only 4 to 5 of the "descriptors" (columns 0->299) with a value.

I am currently converting the Spark dataframe to an RDD and using saveRdd to write the data.

This works, but "null" is stored in columns when there is no value.

For example:

  val saveRdd = sample.rdd

  saveRdd.map(line => (
    line(0), line(1), line(2),
    line(3), line(4), line(5),
    line(6), line(7), line(8),
    line(9), line(10), line(11),
    line(12), line(13), line(14),
    line(15), line(16), line(17),
    line(18), line(19), line(20))).saveToCassandra..........

Creates this in Cassandra:

Setting spark.cassandra.output.ignoreNulls on SparkSession does not work:

spark.conf.set("spark.cassandra.output.ignoreNulls", "true")
spark.conf.get("spark.cassandra.output.ignoreNulls")

This does not work either:

spark-shell  --conf spark.cassandra.output.ignoreNulls=true

(tried different ways to set this and it doesn't seem to work any way I set it)

withColumn and filter do not seem to be appropriate solutions. An unset concept might be the right thing, but not sure how to use that in this case.

cassandra.3.11.2

spark-cassandra-connector:2.3.0-s_2.11

spark 2.2.0.2.6.3.0-235

Thank you!

Spark To Cassandra: Writing Sparse Rows With No Null Values To Cassandra

Answers (1)

Related Questions