austin_ce
austin_ce

Reputation: 1152

Does every record in a Flink EventTime application need a timestamp?

I'm building a Flink Streaming system that can handle both live data and historical data. All data comes from the same source and then in split into historical and live. The live data gets timestamped and watermarked, while the historical data is received in-order. After the live stream is windowed, both streams are unioned and flow into the same processing pipeline.

I cannot find anywhere if all records in an EventTime streaming environment need to be timestamped, or if Flink can even handle this mix of live and historical data at the same time. Is this a feasible approach or will it create problems that I am too inexperienced to see? What will the impact be on the order of the data?

We have this setup to allow us to do partial-backfills. Each stream is keyed by an id, and we send in historical data to replace the observed data for one id while not affecting the live processing of other ids.

This is the job graph: enter image description here

Upvotes: 0

Views: 440

Answers (1)

David Anderson
David Anderson

Reputation: 43499

Generally speaking, the best approach is to have proper event-time timestamps on every event, and to use event-time everywhere. This has the advantage of being able to use the exact same code for both live data and historic data -- which is very valuable when the need arises to re-process historic data in order to fix bugs or upgrade your pipeline. With this in mind, it's typically possible to do backfill by simply running a second copy of the application -- one that's processing historic data rather than live data.

As for using a mix of historic and live data in the same application, and whether you need to have timestamps and watermarks for the historic events -- it depends on the details. For example, if you are going to connect the two streams, the watermarks (or lack of watermarks) on the historic stream will hold back the watermarks on the connected stream. This will matter if you try to use event-time timers (or windows, which depend on timers) on the connected stream.

I don't think you're going to run into problems, but if you do, a couple of ideas:

  1. You could go ahead and assign timestamps on the historic stream, and write a custom periodic watermark generator that always returns Watermark.MAX_WATERMARK. That will effectively disable any effect the watermarks for the historic stream would have on the watermarking when it's connected to the live stream.
  2. Or you could decouple the backfill operations, and do that in another application (by putting some sort of queuing in-between the two jobs, like Kafka or Kinesis).

Upvotes: 2

Related Questions