Shushu
Shushu

Reputation: 792

RAM disks in GCP Dataflow - is it possible?

Google Compute Engine support RAM disks - see here.
I develop a project that will reuse existing code which manipulate local files.
For scalability, I am going to use Dataflow.
The files are in GCS, and I will send it to the Dataflow workers for manipulation.
I was thinking on creating better performance by using RAM disks on the workers, copy the files from GCS directly to the RAM disk, and manipulate it there.
I fail to find any example of such capability.

Is this a valid solution, or should I avoid this kind of a "trick" ?

Upvotes: 1

Views: 469

Answers (2)

Travis Webb
Travis Webb

Reputation: 15018

While what you want to do might be technically possible by creating a setup.py with custom commands, it will not help you increase performance. Beam already uses as much of the workers' RAM as it can in order to perform effectively. If you are reading a file from GCS and operating on it, then that file is already going to be loaded into RAM. By earmarking a big chunk of RAM to a ramdisk, you will probably make Beam run slower, not faster.

If you just want stuff to happen faster, try using SSD, increase the # of workers, or try using the c2 machine family.

Upvotes: 1

Ricco D
Ricco D

Reputation: 7287

It is not possible to to use ramdisk as the disk type for the workers since ramdisk is being set up on an OS level. The only available disk for the workers are Standard persistent disks (pd-standard), and SSD persistent disks (pd-ssd). Among these, SSD is definitely faster. You can try adding more workers or using a faster CPU to process your data faster.

For comparison I tried running a job that uses standard and ssd and it turns out that it is 13% faster when using SSD compared to standard disk. But take note that I just tested the quick start from the dataflow docs.

Using SSD (3m 54s elapsed time):

enter image description here

Using Standard Disk (4m 29s elapsed time):

enter image description here

Upvotes: 1

Related Questions