leech
leech

Reputation: 289

Memory leak in Apache Beam Python ReadFromPubsub IO

Overview

We have a Dataflow streaming pipeline that reads messages from a pubsub subscription, transforms the dict to a dataclass and writes the data to postgres. I noticed that occasionally, pubsub throughput will go to zero. During this time, max memory utilization is typically at 95+%, though most of the time, data still flows through steadily despite memory being at 95+%.

enter image description here

Debugging Steps

To debug the issue, I slowly deleted PTransforms one bye one (from bottom to top), deployed it and observed. The issue persisted throughout all setups, even when the entire pipeline was just a single ReadFromPubsub transform and nothing else. This makes me suspect that it could be a memory leak in the library implementation.

Observations / Comments

Setup

Upvotes: 0

Views: 470

Answers (1)

Valentyn
Valentyn

Reputation: 565

We have identified the leak, https://github.com/apache/beam/issues/28246 has details and workarounds.

Upvotes: 1

Related Questions