DonCharlie
DonCharlie

Reputation: 53

Private IPs to run Google Dataflow(Apache Beam jobs)

We are using Python SDK for apache beam within the Google Dataflow enviroment. The tool is amazing however we are concerned on privacy issues of those jobs, as it look like it uses Public IPs to run workers. Our questions are:

Our job template looks like this:

options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])

#GoogleCloud options

google_cloud_options = options.view_as(GoogleCloudOptions)

google_cloud_options.project = PROJECT

google_cloud_options.job_name = job_name

google_cloud_options.staging_location = 'gs://{​​​​​​​​}​​​​​​​​/staging'.format(BUCKET)

google_cloud_options.temp_location = 'gs://{​​​​​​​​}​​​​​​​​/temp'.format(BUCKET)

google_cloud_options.region = REGION


#Worker options

worker_options = options.view_as(WorkerOptions)

worker_options.subnetwork = NETWORK

worker_options.max_num_workers = 25


options.view_as(StandardOptions).runner = RUNNER




 ### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.


### The code runs like this in the end

p = beam.Pipeline(options = options)

...

...

...

run = p.run()

run.wait_until_finish()

Thanks!

Upvotes: 0

Views: 1090

Answers (1)

robertwb
robertwb

Reputation: 5104

You also need to pass the --no_use_public_ips option, see https://cloud.google.com/dataflow/docs/guides/specifying-networks#python

Upvotes: 2

Related Questions