Kavitha Srinivas
Kavitha Srinivas

Reputation: 21

google dataflow write to spanner creates hundreds of workers to write a 79G file into a table

I am trying to write a 79G file into a single table in Google spanner. The table has 1 string key column, and 13 string columns and 4 string array columns. I am using this example from Google cloud dataflow spanner examples modified for the table I created. That threw an io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations. on 79G file (it worked fine with a small example file). Nothing had been written to the database from the dataflow job.

To fix that, I followed the advice suggested here to set the batch size of the write to 1KB with: SpannerIO.write().withBatchSizeBytes(1024).
That caused dataflow job to autoscale to >300 workers, and the graph seemed to suggest it would take >18 hours. Is this expected behavior for a dataset this size? Is the issue the batch size? If so what is the recommended size to avoid the exception about too many mutations while keeping load times low?

Upvotes: 2

Views: 1084

Answers (1)

VictorGGl
VictorGGl

Reputation: 1916

This will be fixed with the following pull . The current limit is 20k mutated column values (including indexes), if you know the size of the record you could estimate the approximate batch size. I'd recommend you to also cap maximum number of workers using the --maxNumWorkers flag.

One thing you can do if you have a lot of null values, is to skip null fields in the mutation builder, so they are not counted as mutations. I think you can safely try batch sizes of 10-100K in this case.

Upvotes: 1

Related Questions