Graham Polley
Graham Polley

Reputation: 14781

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.

After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.

Our data/workflow:

What we've got running so far:

We've run the job using a few different worker configurations to see how it performs:

  1. 5 workers (5 vCPUs) took ~17 mins
  2. 5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
  3. 50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
  4. 100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins

Would those times be in line with what you would expect for our use case and pipeline?

Upvotes: 1

Views: 1796

Answers (2)

G B
G B

Reputation: 755

You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?

Upvotes: 2

Eric Schmidt
Eric Schmidt

Reputation: 1317

BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.

We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.

Net-net increasing instances is not going to get you much more throughput right now.

Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)

Make sense?

Upvotes: 1

Related Questions