Memory leak in Apache Beam Python ReadFromPubsub IO

Question

Overview

We have a Dataflow streaming pipeline that reads messages from a pubsub subscription, transforms the dict to a dataclass and writes the data to postgres. I noticed that occasionally, pubsub throughput will go to zero. During this time, max memory utilization is typically at 95+%, though most of the time, data still flows through steadily despite memory being at 95+%.

Debugging Steps

To debug the issue, I slowly deleted PTransforms one bye one (from bottom to top), deployed it and observed. The issue persisted throughout all setups, even when the entire pipeline was just a single ReadFromPubsub transform and nothing else. This makes me suspect that it could be a memory leak in the library implementation.

Observations / Comments

The falling edge of the memory typically corresponds to a scaling event. During scaling events from 1->2 workers, we've also noticed that if we initially had worker A only, sometimes it ends up having workers B and C, instead of A and B. Does this mean that Dataflow is aware that worker A was having mem issues and as such decided to shut it down?
- Other times, the falling edge corresponded to a restart of worker A. This also clears memory.
This issue was also raised in another forum, with this response suggesting windowing as a solution. This is my attempt at understanding what the author means in a hand wavy way - In a streaming pipeline that only uses a global window (my case), DoFns don't get a chance to garbage collect as they are permanently "on"? (teardown is never called?) As such, we need to give it a gap and we can force this by creating windows (though the logic after that will be cumbersome - something like window into a 5 seconds window, fire repeatedly when there's 1 element and finally a group by key and flattening the results again. basically a no-op just that we want to force windowing in)

Setup

Apache beam version - 2.48.0
Python version - 3.9.14
Runner - Dataflow

Valentyn · Accepted Answer

We have identified the leak, https://github.com/apache/beam/issues/28246 has details and workarounds.

Memory leak in Apache Beam Python ReadFromPubsub IO

Answers (1)

Related Questions