Reputation: 1855
According to Apache Beam Documentation
“event time” determined by the timestamp on the data element itself
“processing time”, determined by the clock on the system processing the element
My Data is a json file, none of my fields is a timestamp. What's my event time in this case?
I'm ingesting data via Pub/Sub and processing data with Cloud Dataflow
Upvotes: 1
Views: 4375
Reputation: 1
In Event time the data is processed based on the timestamp at the source of each record. It is essentially the time in which event is created.
Process time is the time of receival of data at the streaming application. This is also the time where the data is processed at the streaming service.
It can be understood from a streaming perspective as Data Creating time: time_stamp and data Arrival time: time_stamp.
Upvotes: 0
Reputation: 75735
The understanding of these 2 notions is paramount in case of using Beam windows. The difference between the Event time (generation of the event published in the PubSub topic) and the real processing by the dataflow in streaming mode is the lag.
This lag is observed by Dataflow, and you can print Stackdriver metric of this. It's computed by Dataflow and it's named Watermark
. It's kind of a lag average.
When you define windows, you can set up trigger according to this Watermark, and data that arrives later. The observation windows themselves can be closed according to this watermark. Not really intuitive at the beginning, but really helpful and powerful!
You can find more details in the Beam Programming Guide
Upvotes: 4
Reputation: 88
Event time is the time event has actually occurred. Event time has to be derived from the field in event, example : timestamp field.
Processing time is the time when the event is processed.
In your case, you can't extract event time.
Upvotes: 3
Reputation: 432
In this case the "event time" is when the event is published to the topic. So for example if your dataflow can't process the published events in publishing frequency, then the event time will lag behind, so your system latency will increase in your dataflow.
Upvotes: 2