Java OutOfMemoryError using PubsubIO

Question

I am writing a simple Dataflow pipeline in Java: PubsubIO -> ConvertToTableRowDoFn -> BigQueryIO

The pipeline is working -- data arrives in BigQuery as expected -- but I'm seeing OutOfMemoryErrors in the Dataflow worker logs.

One experiment I tried is slowing down the ConvertToTableRowDoFn by adding Thread.sleep(100). I was thinking that this would make the batch sizes smaller for BigQueryIO, but to my surprise, this made the OutOfMemoryErrors more frequent!

This makes me think that something in PubsubIO is reading data too quickly or doing too much buffering. Any tips for how to investigate this, or pointers on how PubsubIO does buffering in the Google Dataflow environment?

Temu · Accepted Answer

Recompiled beam with FILE_TRIGGERING_RECORD_COUNT = 100000 instead of 500000 and we haven't seen any OOMs since.

Java OutOfMemoryError using PubsubIO

Answers (1)

Related Questions