Possible to lookup Cassandra in each micro batch

Question

We are using structured streaming and trying to do some dedup for source data. If id col is duplicate in 20 days, we need to upsert with earliest event time. 20 day might have 10-15 billion rows. We don't want to use dropDuplicates as the state might be huge. We are thinking using a Cassandra table to store the state (say id and min time so far). Every time the micro-batch triggered, we lookup Cassandra table storing state with the ids in the micro-batch. Ids in 20 day is also at 10-15 billion level, or in other word, the state table in Cassandra is at 10-15 billion level. So is it feasible to lookup or join with this Cassandra table in each micro-batch?

Possible to lookup Cassandra in each micro batch

Answers (1)

Related Questions