Reputation: 289
Overview
We have a Dataflow streaming pipeline that reads messages from a pubsub subscription, transforms the dict to a dataclass and writes the data to postgres. I noticed that occasionally, pubsub throughput will go to zero. During this time, max memory utilization is typically at 95+%, though most of the time, data still flows through steadily despite memory being at 95+%.
Debugging Steps
To debug the issue, I slowly deleted PTransforms one bye one (from bottom to top), deployed it and observed. The issue persisted throughout all setups, even when the entire pipeline was just a single ReadFromPubsub
transform and nothing else. This makes me suspect that it could be a memory leak in the library implementation.
Observations / Comments
The falling edge of the memory typically corresponds to a scaling event. During scaling events from 1->2 workers, we've also noticed that if we initially had worker A only, sometimes it ends up having workers B and C, instead of A and B. Does this mean that Dataflow is aware that worker A was having mem issues and as such decided to shut it down?
This issue was also raised in another forum, with this response suggesting windowing as a solution. This is my attempt at understanding what the author means in a hand wavy way - In a streaming pipeline that only uses a global window (my case), DoFns don't get a chance to garbage collect as they are permanently "on"? (teardown
is never called?) As such, we need to give it a gap and we can force this by creating windows (though the logic after that will be cumbersome - something like window into a 5 seconds window, fire repeatedly when there's 1 element and finally a group by key and flattening the results again. basically a no-op just that we want to force windowing in)
Setup
2.48.0
3.9.14
Dataflow
Upvotes: 0
Views: 470
Reputation: 565
We have identified the leak, https://github.com/apache/beam/issues/28246 has details and workarounds.
Upvotes: 1