Matus Cimerman
Matus Cimerman

Reputation: 447

Google Dataproc in-cluster encryption

We're working on becoming GDPR compliant. One of the core issues is data encryption. I know there is in-transit data encryption when data are moved between nodes in Google Cloud Platform. What about data encryption within cluster, e.g. during shuffling, when using Google Dataproc? Also, is data being encrypted when tmp dir is used internally by Spark (by default there are plain text files)?

Upvotes: 3

Views: 382

Answers (1)

Karthik Palaniappan
Karthik Palaniappan

Reputation: 1383

Dataproc is built on GCE VMs, so the same security applies.

All data on disks of GCE VMs (PDs or local SSDs) are encrypted: https://cloud.google.com/compute/docs/disks/. So Spark's tmp dir is indeed encrypted.

Network communication that leaves Google data centers (e.g. cross-region traffic) is encrypted. Also, Google API access is encrypted. However, node-to-node communication within a datacenter (likely all in-cluster Dataproc traffic) is not encrypted. You can read more here: https://cloud.google.com/security/encryption-in-transit/.

That being said, in-cluster communication is essentially airgapped. Node-to-node communication happens over internal IPs on your isolated VPC network. Dataproc has guidance on how to configure firewall rules.

You can also use Dataproc private IP clusters to avoid having external IP addresses on the VMs.

Here is the doc on Google Cloud GDPR compliance: https://www.google.com/cloud/security/gdpr/.

Upvotes: 6

Related Questions