Reputation: 9125

Privileged / capabilities in a Dataflow container

I'm trying to run a Dataflow application such that the container it runs in will be privileged, or at least will have certain capabilities (such as CAP_SYS_PTRACE).

Taking the top_wikipedia_sessions.py as an example, I can run it this way with Apache Beam:

python3 -m apache_beam.examples.complete.top_wikipedia_sessions \
  --region us-central1 \
  --runner DataflowRunner \
  --project my_project \
  --temp_location gs://my-cloud-storage-bucket/temp/ \
  --output gs://my-cloud-storage-bucket/output/

If I SSH into the created instance, I can see with docker ps that the started container is not privileged and has no capabilities kept (nothing in its CapAdd). I couldn't find any way in Apache Beam to control it. I suppose I could SSH into the instances and update their Docker settings, but I wonder if there's a way around it that doesn't require manually modifying the instances Dataflow starts for me. Perhaps it's a settings I need to update at the cluster settings of GCP?

Upvotes: 2

Answers (2)

Kenn Knowles

Reputation: 6023

Since your goal is to profile you job with py-spy, have you considered and used Dataflow's profiling capabilities?

Docs on profiling a Dataflow Python pipeline
Essay on profiling Dataflow Python
Docs on Cloud Profiler and Python in case you need to dig further
Essay on profiling Dataflow Java could have useful info

Upvotes: 0

Kenn Knowles

Reputation: 6023

There is no way currently to directly modify the launch of the docker containers on a Dataflow worker.

However, jobs with GPUs enabled do run their containers in privileged mode. This has cost implications but could be a way to experimentally confirm that the feature addresses your need. If you can share more about the specific use case, perhaps it will make sense to generalize this feature to non-GPU jobs.

Upvotes: 1

Privileged / capabilities in a Dataflow container

Answers (2)

Related Questions