Yauza
Yauza

Reputation: 180

Dataflow control high fanout between steps

I have 3 dataflow steps in a Dataflow pipeline.

  1. Reads from pubsub , saves in a table and splits into multiple events(puts into context output).
  2. For each split, queries db and decorates the event with additional data.
  3. Publishes to another pubsub topic for further procession.

PROBLEM:
After step 1, its splitting into 10K to 20K events.

Now in step 2 its running out of database connections. (I have a static hikari connection pool).

It works absolutely fine will less data. I am using a n1-standard-32 machine.

What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.

Upvotes: 0

Views: 1303

Answers (1)

Rui Wang
Rui Wang

Reputation: 839

I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).

Ideas include:

  1. Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.

  2. One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).

Upvotes: 1

Related Questions