Reputation: 53
We are using Python SDK for apache beam within the Google Dataflow enviroment. The tool is amazing however we are concerned on privacy issues of those jobs, as it look like it uses Public IPs to run workers. Our questions are:
Our job template looks like this:
options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])
#GoogleCloud options
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)
google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)
google_cloud_options.region = REGION
#Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 25
options.view_as(StandardOptions).runner = RUNNER
### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.
### The code runs like this in the end
p = beam.Pipeline(options = options)
...
...
...
run = p.run()
run.wait_until_finish()
Thanks!
Upvotes: 0
Views: 1090
Reputation: 5104
You also need to pass the --no_use_public_ips
option, see https://cloud.google.com/dataflow/docs/guides/specifying-networks#python
Upvotes: 2