Wirelessr
Wirelessr

Reputation: 65

How does stream processing deal with volatile streams?

I have read several articles and books about stream processing and they all assume all past data will be available in the stream when doing the event sourcing.

Therefore, we can do real time queries by building tables like KSQL.

But the reality is that the data in the stream is not always there from the past to the present. In the case of Kinesis, the default is to keep data for 24 hours, up to seven days.

There are two scenarios that bother me.

  1. When the microservice system has been running for a while and then decides to do a streaming architecture such as Flink or Kafka Stream, how do we fill in the streaming data that was not there before?
  2. When the streaming data will only be kept for a period of time, say one day, is there still a way to get the status of one day ago by creating a table?

Upvotes: 0

Views: 165

Answers (1)

blr
blr

Reputation: 968

There's two general techniques that I know for EDAs

1/ Take regular snapshots from a reliable set of sources (maybe even the publishers or by having a dedicated consumer that takes snapshots). You then merge them into your internal state as a consumer to come into sync. This option works as long as you're okay with being out-of-sync at most for 1 day or whatever is the typical duration of the snapshots.

2/ The second option is having a archive of event-store. You replay from the event-store past messages to the relevant consumers.

In either case, you'll need to solve for deduping / merging / filtering data based on your unique business case.

Upvotes: 1

Related Questions