Benedetto
Benedetto

Reputation: 749

Spark Structured Streaming stream-stream join question

Scenario is slightly different than the classic stream-stream join

streamA: stream of transactions: transTS, userid, productid,...

streamB: stream of new products created: productid, productname, createTS, ...)

I want to join transactions with productIds, but I can't find a combination of watermarks/join conditions to make that happen.

streamA_wm = streamA.withWatermark("transTS", "3 minutes")
streamB_wm = streamB.withWatermark("createTS", "1 day")

streamA_wm
   .join(streamB_wm, "productId AND transTS >= createTS", "leftOuter")

The result is empty.

What am I doing wrong?

Upvotes: 1

Views: 265

Answers (1)

Ged
Ged

Reputation: 18128

I think you may have the wrong approach here. Whilst products are transactional when created and updated, they are metadata relative the other Transaction Stream.

I would suggest the following:

  1. Join the Transactions Stream to the reference data Products - which is not subjected to Stream processing.
  2. Do not cache the Products, this ensures you go to source.
  3. Use parquet, KUDU for the Products.

But may be there is a reason for a Stream for Products, but ... What happens if no more updates to Product made and you get data for that Product again via the Stream for Transactions?

Upvotes: 1

Related Questions