Reputation: 180
I have 3 dataflow steps in a Dataflow pipeline.
PROBLEM:
After step 1, its splitting into 10K to 20K events.
Now in step 2 its running out of database connections. (I have a static hikari connection pool).
It works absolutely fine will less data. I am using a n1-standard-32 machine.
What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.
Upvotes: 0
Views: 1303
Reputation: 839
I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).
Ideas include:
Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.
One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).
Upvotes: 1