How do you determine how many resources to provision in a Google Dataflow streaming pipeline?

Question

I'm curious how to decide on how to provision resources for Apache Beam pipelines running on Google's Dataflow platform. I've built a streaming pipeline (Beam Java 2.0.0) that takes a PubSub JSON string, transforms it to a BQ TableRow, then routes it to the correct tables. There are also two transforms within the pipeline, one with a 5 minute sliding window every minute and another window with a 1 minute fixed time duration.

For some context, each incoming message is about a 1KB JSON string, and at an extreme peak the pipeline will receive 250,000 messages in one second. My sliding time window could possibly grow to have 5,000,000 million tablerows / minute before it closes (worst case scenario, but that's what we're planning for). Our typical peak traffic usage is about 75k messages / second. However, 90% of the time our pipeline is processing only 30 messages / second.

We're running on dataflow with autoscaling enabled, and by default Google provisions 4 CPUs, 15GB, and 420gb * max_number of workers for streaming pipelines. With 10 max workers set, we're going to be paying for 4.2TB of disk usage a month. That seems a bit overkill, but I don't know what data I should be looking at to verify my theory.

Something I've been thinking about is to instead use 2 CPUs and 7.5 GB of memory with 20GB of SSD per worker, and setting the max number of workers at 50. Under this configuration, we'd have at minimum 4 workers.

Summary of my spiel:
- How do you determine the CPU, RAM, and disk space you need for your streaming pipelines?
- How do you determine that a pipeline should provision SSD resources instead of standard harddrives?
- What metric measurements can I look at to measure performance of my pipeline?

Slava Chernyak · Accepted Answer

Since pipelines are very different, there is no all purpose general way to say how many workers and what sizes of disks to use. There are several approaches that do work well though:

Dataflow's horizontal scaling is very close to linear. This means that if you run a sampled pipeline (eg by sampling 10% of your input traffic) you can very quickly estimate the resources the full pipeline will need, without overpaying. You can tell if the pipeline is "keeping up" with the input, if the system lag stays low, and the data watermark continues to advance. You can then estimate the maximum number of workers that your pipeline will need at peak input rate using this strategy. Lets call this number m
Having done the above, you can then rely on autoscaling, having set the maxNumWorkers flag to a number k*m where k will effectively determine how quickly your pipeline can catch up from a backlog at peak load. Eg, at k=1 the pipeline can only keep up with peak load, so a backlog at peak load may never be drained, or wait for non-peak load to drain. at k=2 the pipeline can process 2x the peak load, so it will catch up faster. Of course this is a tradeoff for how many resources you are willing to pay for during backlog, and how much catchup latency you are willing to tolerate.
Autoscaling will also ensure that the pipeline downscales during non-peak load, so that you will not be paying for all of the resources during non-peak times.

A few other notes:

Streaming dataflow tends to perform better with 4 CPU workers vs 2 CPU workers. This is because there is some per-worker overhead, and certain tuning for work parallelism that is optimized to 4 CPU workers.
SSD use should already be enabled by default when using dataflow, as SSDs drastically improve write throughput and lead to much better performance.

How do you determine how many resources to provision in a Google Dataflow streaming pipeline?

Answers (1)

Related Questions