Reputation: 121
I'm using structured stream. I need to left join a huge (billions of rows) Cassandra table to know whether the source data in micro-batch is new or existed in terms of id col. If I do something like:
val src = spark.read.cassandraFormat("src", "ks").load().select("id")
val query= some_dataset
.join(src, expr("src.id=some_dataset.id"), joinType = "leftOuter")
.withColumn("flag", expr("case when src.id is null then 0 else 1 end"))
.writeStream
.outputMode("update")
.foreach(...)
.start
Can Cassandra push down the left join and look up with the join col value in source delta? Is there a way to tell whether the pushdown happened or not?
Thanks
Upvotes: 1
Views: 317
Reputation: 87119
Not in the open source version of Spark Cassandra Connector. There is a support for it as DSE Direct Join in DSE Analytics, so if you use DataStax Enterprise, you'll get it. If you're using OSS connector you're limited to RDD API only.
Update, May 2020th: optimized join on dataframes is supported since SCC 2.5.0, together with other commercial features. See this blog posts for details.
Upvotes: 1