sacherus
sacherus

Reputation: 2129

Why my data in apache-beam is emitted after a few minutes instead of 10h?

I wanted to have global window which lasts for 10 hours after seeing the first element, but what is happening data is being emitted after a few minutes (or seconds). Why?

Code:

grouped_tis = tracking_informations | beam.WindowInto(window.GlobalWindows(),
                                                        trigger=AfterProcessingTime(10 * 3600),
                                                        accumulation_mode=AccumulationMode.DISCARDING) | beam.GroupByKey() | beam.ParDo(MergeTI())

In dataflow After 30 minutes I already get a lot of dropped elements: droppedDueToClosedWindow 39,147 GroupByKey

Upvotes: 1

Views: 226

Answers (1)

Mikhail Gryzykhin
Mikhail Gryzykhin

Reputation: 106

This looks like a bug in SDK. I've created a jira ticket for Apache Beam Python SDK devs to look into the issue.

It seems that AfterProcessingTime fires early and it causes the window to close. All events that come after are properly discarded due to window being closed.

Upvotes: 1

Related Questions