Kakaji
Kakaji

Reputation: 1491

Why does dataflow use additional disks?

When I see the details of my dataflow compute engine instance, I can see two categories of disks being used - (1) Boot disk and local disks, and (2) Additional disks.

I can see that the size that I specify using the diskSizeGb option determines the size of a single disk under the category 'Boot disk and local disks'. My not-so-heavy job is using 8 additional disks of 40GB each.

What are additional disks used for and is it possible to limit their size/number?

Upvotes: 1

Views: 1119

Answers (2)

Pablo
Pablo

Reputation: 11031

The existing answer explains how many disks, and information about the disks - but it does not answer the main question: Why so many disks per worker?

WHY does Dataflow need several disks per worker?

The way in which Dataflow does load balancing for streaming jobs is that a range of keys is allocated to each disk. Persistent state about each key is stored in these disks.

A worker can be overloaded if the ranges that are allocated to its persistent disks have a very high volume. To load-balance, Dataflow can move a range from one worker to another by transferring a persistent disk to a different worker.

So this is why Dataflow uses multiple disks per worker: Because this allows it to do load balancing and autoscaling by moving the disks from worker to worker.

Upvotes: 0

Mangu
Mangu

Reputation: 3325

Dataflow will create for your job Compute Engine VM instances, also known as workers.

To process the input data and store temporary data, each worker may require up to 15 additional Persistent Disks.

The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. 40GB is very far from the default value

In this case, the Dataflow service will span more disks for your worker. If you want to keep a 1:1 ratio between workers and disks, please increase the ‘diskSizeGb’ field.

Upvotes: 1

Related Questions