DougieHauser
DougieHauser

Reputation: 470

Handling "state refresh" in Flink ConnectedStream

We're building an application which has two streams:

  1. A high-volume messages stream
  2. A large static stream (originating from some parquet files we have lying around) which we feed into Flink just to get that Dataset into a saved state

We want to connect the two streams in order to get shared state, so that the 1st stream can use the 2nd state for enrichment.

Every day or so, the parquet files (2nd streams's source) are updated, and that will require us to clear the state of the 2nd stream and rebuild it (will probably take about 2 minutes).

The question is, can we block/delay messages from the 1st stream while this process is running?

Thanks.

Upvotes: 0

Views: 299

Answers (2)

ariskk
ariskk

Reputation: 166

Sounds a bit like your case is similar to Flip-23, which explores Model Serving in Apache Flink.

I think it all boils down to how (and if) your static stream is keyed:

  • if it is keyed in a similar way as your fast data, then you can key both streams, connect them and then have access to the keyed context.
  • if the static stream events are not keyed in a similar fashion maybe you should consider emitting control events which will trigger a refresh of those static files from an external source (eg s3). That's easier said than done as there is no trivial way to guarantee that all parallel instances of your fast stream will get the control event. You can use ListState as a buffer, how you can access this though depends on the shape of your data.

It might help, if you shared a bit more info about the shape of your data (eg are you joining on a key? are you simply serving a model? other?).

Upvotes: 1

kkrugler
kkrugler

Reputation: 9245

There's currently no direct/easy way to block one stream on another stream, unfortunately. The typical solution is to buffer the ingest stream while you load (or re-load) the enrichment stream.

One approach you could try is to wrap your ingest stream in a custom SourceFunction that knows when to not generate data, based on some external trigger (which is the same signal you'd use to know that you have Parquet data to re-load).

Upvotes: 1

Related Questions