processadd
processadd

Reputation: 121

Possible to lookup Cassandra in each micro batch

We are using structured streaming and trying to do some dedup for source data. If id col is duplicate in 20 days, we need to upsert with earliest event time. 20 day might have 10-15 billion rows. We don't want to use dropDuplicates as the state might be huge. We are thinking using a Cassandra table to store the state (say id and min time so far). Every time the micro-batch triggered, we lookup Cassandra table storing state with the ids in the micro-batch. Ids in 20 day is also at 10-15 billion level, or in other word, the state table in Cassandra is at 10-15 billion level. So is it feasible to lookup or join with this Cassandra table in each micro-batch?

Upvotes: 0

Views: 143

Answers (1)

Alex Ott
Alex Ott

Reputation: 87174

The Spark Cassandra connector has 2 corresponding functions in the RDD API: joinWithCassandra and leftJoinWithCassandra that allow to perform effective lookup of data in Cassandra by primary key, like this:

val joinWithRDD = someRDD.joinWithCassandraTable("test","table")

The join with Cassandra functionality isn't supported in the DataFrame/DataSet API in the open source version of connector, but is supported in the connector that is the part of DSE Analytics (so-called DSE Direct Join). But you can convert your data into RDD & perform join via existing API.

Upvotes: 1

Related Questions