Reputation: 7940
In Beam documentation, it is recommended to use withLogAppendTime over withProcessingTime. Why would this be the case?
Upvotes: 2
Views: 200
Reputation: 43499
A couple of reasons to prefer event time processing:
You can redo the processing later -- for example, to fix a bug, make changes, or to test another approach. Being able to use the exact same code on both live and historic streams makes things easier.
Consistent, deterministic behavior -- if you run the same data through the same code, you will get the same results. This is not the case with processing time. Again, this makes some things (like testing) easier.
Upvotes: 0
Reputation: 2539
As cricket_007 said, it depends on your use case.
One of the key concepts in Beam is event time processing. That is, you can define your data processing logic not in terms of when the service (Beam pipeline) receives the data, but instead in terms of when the event actually occurred (e.g. when a user actually clicked the ad). This helps in streaming cases, when your data streams can contain late or out-of-order events. Beam allows you to handle these cases.
E.g. if your pipeline has a step that does something like "aggregate events that occurred between 1pm and 2pm on Oct 23 2018", what happens if an event that actually happened at 1.30pm arrives late (say at 3.30pm) due to some network delays or something else? In processing-time based approach this late event will probably be accounted for in the next window (e.g. "2pm to 3pm"). But there's a good chance that your business logic would prefer to recompute the original aggregation of "1pm to 2pm" instead of using the late event in another aggregation. Handling business cases like this is the main reason for event time processing.
However, you might not be interested in handling that in your business logic, e.g. if you don't do any windowing/aggregations (e.g. basic ETL), or if you don't have late data at all (e.g. when you're reading from an existing file), or your business logic simply doesn't care about it, or events are rare and delivery is reliable enough, or you might not have the reliable timestamp available to you in the event data, etc etc etc... So you might choose to use processing time instead. All depends on how your business logic needs the data to be processed.
Event timestamps are assigned close the event source in Beam (usually in IO), so in case of Kafka you have these options to choose where the event timestamp comes from: https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/kafka/TimestampPolicy.html . Other sources can use other ways to assign timestamps to events (e.g. PubsubIO can read a timestamp specified in the message attributes).
I recommend looking through the presentations here, they go deeper into this topic: https://beam.apache.org/documentation/resources/
Upvotes: 2