Xinwei Liu
Xinwei Liu

Reputation: 353

credentials used by google cloud dataflow

Currently have some confusion on the credentials/configuration used by dataflow...

From my experimentation, it seems that dataflow is always using the default configuration instead of the active configuration. Is that correct? (For example in my gcloud config if I have a default configuration with project A while my active configuration is on project B, it seems that my dataflow job will submit to project A always. Also in this way it seems that the dataflow job is ignoring what is set in options.setProject(), so sort of wondering when is dataflow using options.getProject() again...?)

And also wondering is there any way that I submit dataflow job with customized configuration, say I want to submit multiple jobs to different projects with different credentials on the same run(without manually changing my gcloud config)?

btw I am running the dataflow job on dataflow services cloud platform but submit the job from non-gce Cloudservices Account if it will make a difference.

Upvotes: 5

Views: 5256

Answers (2)

Matthias Baetens
Matthias Baetens

Reputation: 1553

The code I used to have Dataflow populate its workers with the service account we wanted (in addition to Lukas answer above):

final List<String> SCOPES = Arrays.asList(
      "https://www.googleapis.com/auth/cloud-platform",
      "https://www.googleapis.com/auth/devstorage.full_control",
      "https://www.googleapis.com/auth/userinfo.email",
      "https://www.googleapis.com/auth/datastore",
      "https://www.googleapis.com/auth/pubsub");
options.setGcpCredential(ServiceAccountCredentials.fromStream(new FileInputStream("key.json")).createScoped(SCOPES));
options.setServiceAccount("[email protected]");

Upvotes: 1

Lukasz Cwik
Lukasz Cwik

Reputation: 1731

Google Cloud Dataflow by default uses the application default credentials library to get the credentials if they are not specified. The library currently only supports getting the credentials using the gcloud default configuration. Similarly, for the project, Google Cloud Dataflow uses the gcloud default configuration.

To be able to run jobs with a different project, one can manually specify on the command-line (for example --project=myProject, if using PipelineOptionsFactory.fromArgs) or set the option explicitly utilizing GcpOptions.setProject.

To be able to run jobs with different credentials, one can construct a credentials object and can explicitly set it utilizing GcpOptions.setGcpCredential or one can rely on using the ways that the application default credentials library supports generating the credentials object automatically which Google Cloud Dataflow is tied into. One example would be to use the environment variable GOOGLE_APPLICATION_CREDENTIALS as explained here.

Upvotes: 6

Related Questions