Max
Max

Reputation: 15975

How to control bounded source split?

I have a Dataflow where the input is a large read from a database. I would like to split this query up and have it executed from multiple hosts when the job starts. As far as I can tell, the BoundedSource has no way of directly controlling the input split. The closest it has is splitIntoBundles which basically means I have to start a very expensive read and hope Dataflow cancels it and uses my defined bundle split instead. This seems pretty crazy, so I'm hoping there is a better way of predefining an input split that can be run on any remote workers.

Upvotes: 0

Views: 613

Answers (1)

Max
Max

Reputation: 15975

After much research, there is no way to control the split parallelism of a single reader. My solution was to create multiple readers, have each reader read into its own PCollection, and then flatten the multiple PCollections into a single PCollection.

Upvotes: 1

Related Questions