Jong
Jong

Reputation: 9125

Privileged / capabilities in a Dataflow container

I'm trying to run a Dataflow application such that the container it runs in will be privileged, or at least will have certain capabilities (such as CAP_SYS_PTRACE).

Taking the top_wikipedia_sessions.py as an example, I can run it this way with Apache Beam:

python3 -m apache_beam.examples.complete.top_wikipedia_sessions \
  --region us-central1 \
  --runner DataflowRunner \
  --project my_project \
  --temp_location gs://my-cloud-storage-bucket/temp/ \
  --output gs://my-cloud-storage-bucket/output/

If I SSH into the created instance, I can see with docker ps that the started container is not privileged and has no capabilities kept (nothing in its CapAdd). I couldn't find any way in Apache Beam to control it. I suppose I could SSH into the instances and update their Docker settings, but I wonder if there's a way around it that doesn't require manually modifying the instances Dataflow starts for me. Perhaps it's a settings I need to update at the cluster settings of GCP?

Upvotes: 2

Views: 301

Answers (2)

Kenn Knowles
Kenn Knowles

Reputation: 6023

Since your goal is to profile you job with py-spy, have you considered and used Dataflow's profiling capabilities?

Upvotes: 0

Kenn Knowles
Kenn Knowles

Reputation: 6023

There is no way currently to directly modify the launch of the docker containers on a Dataflow worker.

However, jobs with GPUs enabled do run their containers in privileged mode. This has cost implications but could be a way to experimentally confirm that the feature addresses your need. If you can share more about the specific use case, perhaps it will make sense to generalize this feature to non-GPU jobs.

Upvotes: 1

Related Questions