bfabry
bfabry

Reputation: 1904

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.

Current ideas:

EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.

Upvotes: 6

Views: 278

Answers (1)

Tudor Marian
Tudor Marian

Reputation: 239

You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

Upvotes: 2

Related Questions