Chris Heath
Chris Heath

Reputation: 136

Java OutOfMemoryError using PubsubIO

I am writing a simple Dataflow pipeline in Java: PubsubIO -> ConvertToTableRowDoFn -> BigQueryIO

The pipeline is working -- data arrives in BigQuery as expected -- but I'm seeing OutOfMemoryErrors in the Dataflow worker logs.

One experiment I tried is slowing down the ConvertToTableRowDoFn by adding Thread.sleep(100). I was thinking that this would make the batch sizes smaller for BigQueryIO, but to my surprise, this made the OutOfMemoryErrors more frequent!

This makes me think that something in PubsubIO is reading data too quickly or doing too much buffering. Any tips for how to investigate this, or pointers on how PubsubIO does buffering in the Google Dataflow environment?

Upvotes: 1

Views: 133

Answers (1)

Temu
Temu

Reputation: 879

Recompiled beam with FILE_TRIGGERING_RECORD_COUNT = 100000 instead of 500000 and we haven't seen any OOMs since.

Upvotes: 1

Related Questions