Reputation: 15975
I have a Dataflow where the input is a large read from a database. I would like to split this query up and have it executed from multiple hosts when the job starts. As far as I can tell, the BoundedSource
has no way of directly controlling the input split. The closest it has is splitIntoBundles
which basically means I have to start a very expensive read and hope Dataflow cancels it and uses my defined bundle split instead. This seems pretty crazy, so I'm hoping there is a better way of predefining an input split that can be run on any remote workers.
Upvotes: 0
Views: 613
Reputation: 15975
After much research, there is no way to control the split parallelism of a single reader. My solution was to create multiple readers, have each reader read into its own PCollection, and then flatten the multiple PCollections into a single PCollection.
Upvotes: 1