Nikes MLIB
Nikes MLIB

Reputation: 21

Google Cloud Dataflow BigQuery to Bigtable Transfer - Throttle Write Speed?

I have a number of Dataflow templates to copy data from BigQuery to Bigtable tables.

The largest of which is about 9 million rows, 22GB worth of data.

There are no complex mutations, it's just a copy.

I've noticed that while running the Dataflow templates, CPU of the Bigtable instance spikes up to 100% and read/write latency is quite slow. This happens even with only 1 worker / no customization on the threads.

I've tried tinkering with the number of workers and constraining the numberOfWorkerHarnessThreads but haven't been able to find a combination that still loads the data in a reasonable amount of time and doesn't spike the Bigtable instance.

Pipeline

        BigQueryBigtableTransferOptions options =
            PipelineOptionsFactory
                .fromArgs(args)
                .withValidation()
                .as(BigQueryBigtableTransferOptions.class);

        CloudBigtableTableConfiguration config =
            new CloudBigtableTableConfiguration.Builder()
                .withProjectId(options.getBigtableProjectId())
                .withInstanceId(options.getBigtableInstanceId())
                .withTableId(options.getBigtableTableId())
                .build();

        Pipeline p = Pipeline.create(options);

        p.apply(BigQueryIO.readTableRows().withoutValidation().fromQuery(options.getBqQuery())
                .usingStandardSql())
            .apply(ParDo.of(new Transform(options.getBigtableRowKey())))
            .apply(CloudBigtableIO.writeToTable(config));

        p.run();

The BigQuery query is just a select * query and the Transform operation just adds a column to the Bigtable row for every column from BQ, no additional logic.

Upvotes: 1

Views: 354

Answers (3)

Bora
Bora

Reputation: 131

In addition to what Nikes MLIB said, you can also use request priorities to run it as a low priority job.

https://cloud.google.com/bigtable/docs/request-priorities

But batch write flow control would be the best place to start https://cloud.google.com/bigtable/docs/writes#flow-control

Upvotes: 1

Nikes MLIB
Nikes MLIB

Reputation: 21

Google recommended setting the mutation flow control configuration option:

https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-client-core-parent/bigtable-hbase/src/main/java/com/google/cloud/bigtable/hbase/BigtableOptionsFactory.java

.withConfiguration(BigtableOptionsFactory.BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL, Boolean.TRUE.toString())

We ran some tests and even with only 1 node in the Bigtable cluster, the cluster CPU hovered around the recommended max CPU instead of spiking to 100% and read times remained relatively flat during the whole duration of the job.

Upvotes: 1

John Hanley
John Hanley

Reputation: 81454

What is the Biggable cluster's node scaling? What is the minimum and current number of nodes?

Workloads are scaled by increasing the number of nodes. When you have sudden batch workloads, it can take 20 minutes under load before there is a significant improvement in cluster performance.

If the cluster is set to autoscale, set the minimum number of nodes so that the cluster does not scale down too far.

If set to manual, add more nodes at least 20 minutes before the workloads increase.

Upvotes: 1

Related Questions