Reputation: 136
I am writing a simple Dataflow pipeline in Java: PubsubIO -> ConvertToTableRowDoFn -> BigQueryIO
The pipeline is working -- data arrives in BigQuery as expected -- but I'm seeing OutOfMemoryErrors in the Dataflow worker logs.
One experiment I tried is slowing down the ConvertToTableRowDoFn by adding Thread.sleep(100). I was thinking that this would make the batch sizes smaller for BigQueryIO, but to my surprise, this made the OutOfMemoryErrors more frequent!
This makes me think that something in PubsubIO is reading data too quickly or doing too much buffering. Any tips for how to investigate this, or pointers on how PubsubIO does buffering in the Google Dataflow environment?
Upvotes: 1
Views: 133
Reputation: 879
Recompiled beam with FILE_TRIGGERING_RECORD_COUNT = 100000 instead of 500000 and we haven't seen any OOMs since.
Upvotes: 1