processadd
processadd

Reputation: 121

Can join with Cassandra table get pushdown?

I'm using structured stream. I need to left join a huge (billions of rows) Cassandra table to know whether the source data in micro-batch is new or existed in terms of id col. If I do something like:

val src = spark.read.cassandraFormat("src", "ks").load().select("id")
val query= some_dataset
      .join(src, expr("src.id=some_dataset.id"), joinType = "leftOuter")
      .withColumn("flag", expr("case when src.id is null then 0 else 1 end"))
      .writeStream
      .outputMode("update")
      .foreach(...)
      .start

Can Cassandra push down the left join and look up with the join col value in source delta? Is there a way to tell whether the pushdown happened or not?

Thanks

Upvotes: 1

Views: 317

Answers (1)

Alex Ott
Alex Ott

Reputation: 87119

Not in the open source version of Spark Cassandra Connector. There is a support for it as DSE Direct Join in DSE Analytics, so if you use DataStax Enterprise, you'll get it. If you're using OSS connector you're limited to RDD API only.

Update, May 2020th: optimized join on dataframes is supported since SCC 2.5.0, together with other commercial features. See this blog posts for details.

Upvotes: 1

Related Questions