Reputation: 87
I'm curious how to decide on how to provision resources for Apache Beam pipelines running on Google's Dataflow platform. I've built a streaming pipeline (Beam Java 2.0.0) that takes a PubSub JSON string, transforms it to a BQ TableRow, then routes it to the correct tables. There are also two transforms within the pipeline, one with a 5 minute sliding window every minute and another window with a 1 minute fixed time duration.
For some context, each incoming message is about a 1KB JSON string, and at an extreme peak the pipeline will receive 250,000 messages in one second. My sliding time window could possibly grow to have 5,000,000 million tablerows / minute before it closes (worst case scenario, but that's what we're planning for). Our typical peak traffic usage is about 75k messages / second. However, 90% of the time our pipeline is processing only 30 messages / second.
We're running on dataflow with autoscaling enabled, and by default Google provisions 4 CPUs, 15GB, and 420gb * max_number of workers for streaming pipelines. With 10 max workers set, we're going to be paying for 4.2TB of disk usage a month. That seems a bit overkill, but I don't know what data I should be looking at to verify my theory.
Something I've been thinking about is to instead use 2 CPUs and 7.5 GB of memory with 20GB of SSD per worker, and setting the max number of workers at 50. Under this configuration, we'd have at minimum 4 workers.
Summary of my spiel:
- How do you determine the CPU, RAM, and disk space you need for your streaming pipelines?
- How do you determine that a pipeline should provision SSD resources instead of standard harddrives?
- What metric measurements can I look at to measure performance of my pipeline?
Upvotes: 2
Views: 2179
Reputation: 819
Since pipelines are very different, there is no all purpose general way to say how many workers and what sizes of disks to use. There are several approaches that do work well though:
m
maxNumWorkers
flag to a number k*m
where k
will effectively determine how quickly your pipeline can catch up from a backlog at peak load. Eg, at k=1
the pipeline can only keep up with peak load, so a backlog at peak load may never be drained, or wait for non-peak load to drain. at k=2
the pipeline can process 2x the peak load, so it will catch up faster. Of course this is a tradeoff for how many resources you are willing to pay for during backlog, and how much catchup latency you are willing to tolerate.A few other notes:
Upvotes: 4