Matus Cimerman
Matus Cimerman

Reputation: 447

Dataproc job reading from another project storage bucket

I've got project A with Storage buckets A_B1 and A_B2. Now Dataproc jobs running from project B needs to have read access to buckets A_B1 and A_B2. Is that possible somehow?

Motivation: project A is production environment with production data stored in Storage. Project B is "experimental" environment running experiment Spark jobs on production data. Goal is to obviously separate billing for production and experiment environment. Similar can be done with dev.

Upvotes: 2

Views: 1376

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10697

Indeed, the Dataproc cluster will be acting on behalf of a service account in project "B"; generally it'll be the default GCE service account, but this is also customizable to use any other service account you create inside of project B.

You can double check the service account name by getting the details of one of the VMs in your Dataproc cluster, for example by running:

gcloud compute instances describe my-dataproc-cluster-m

It might look something like <project-number>[email protected]. Now, in your case if you already have data in A_B1 and A_B2 you would have to recursively edit the permissions on all the contents of those buckets to add access for your service account using something like gsutil -m acl ch -r -u [email protected]:R gs://foo-bucket; while you're at it, you might also want to change the bucket's "default ACL" so that new objects also have that permission. This could get tedious to do for lots of projects, so if planning ahead, you could either:

  1. Grant blanket GCS access into project A for project B's service account by adding the service account as a project member with a "Storage Reader" role
  2. Update the buckets that might need to be shared in project A with read access and/or write/owners access by a new googlegroup you create to manage groupings of permissions. Then you can atomically add service accounts as members to your googlegroup without having to re-run a recursive update of all the objects in the bucket.

Upvotes: 2

Related Questions