google dataflow write to spanner creates hundreds of workers to write a 79G file into a table

Question

I am trying to write a 79G file into a single table in Google spanner. The table has 1 string key column, and 13 string columns and 4 string array columns. I am using this example from Google cloud dataflow spanner examples modified for the table I created. That threw an io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations. on 79G file (it worked fine with a small example file). Nothing had been written to the database from the dataflow job.

To fix that, I followed the advice suggested here to set the batch size of the write to 1KB with: SpannerIO.write().withBatchSizeBytes(1024).
That caused dataflow job to autoscale to >300 workers, and the graph seemed to suggest it would take >18 hours. Is this expected behavior for a dataset this size? Is the issue the batch size? If so what is the recommended size to avoid the exception about too many mutations while keeping load times low?

google dataflow write to spanner creates hundreds of workers to write a 79G file into a table

Answers (1)

Related Questions