Why does dataflow use additional disks?

Question

When I see the details of my dataflow compute engine instance, I can see two categories of disks being used - (1) Boot disk and local disks, and (2) Additional disks.

I can see that the size that I specify using the diskSizeGb option determines the size of a single disk under the category 'Boot disk and local disks'. My not-so-heavy job is using 8 additional disks of 40GB each.

What are additional disks used for and is it possible to limit their size/number?

Mangu · Accepted Answer

Dataflow will create for your job Compute Engine VM instances, also known as workers.

To process the input data and store temporary data, each worker may require up to 15 additional Persistent Disks.

The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. 40GB is very far from the default value

In this case, the Dataflow service will span more disks for your worker. If you want to keep a 1:1 ratio between workers and disks, please increase the ‘diskSizeGb’ field.

Why does dataflow use additional disks?

Answers (2)

WHY does Dataflow need several disks per worker?

Related Questions