Reputation: 9125
I'm trying to run a Dataflow application such that the container it runs in will be privileged, or at least will have certain capabilities (such as CAP_SYS_PTRACE
).
Taking the top_wikipedia_sessions.py as an example, I can run it this way with Apache Beam:
python3 -m apache_beam.examples.complete.top_wikipedia_sessions \
--region us-central1 \
--runner DataflowRunner \
--project my_project \
--temp_location gs://my-cloud-storage-bucket/temp/ \
--output gs://my-cloud-storage-bucket/output/
If I SSH into the created instance, I can see with docker ps
that the started container is not privileged and has no capabilities kept (nothing in its CapAdd
). I couldn't find any way in Apache Beam to control it. I suppose I could SSH into the instances and update their Docker settings, but I wonder if there's a way around it that doesn't require manually modifying the instances Dataflow starts for me. Perhaps it's a settings I need to update at the cluster settings of GCP?
Upvotes: 2
Views: 301
Reputation: 6023
Since your goal is to profile you job with py-spy, have you considered and used Dataflow's profiling capabilities?
Upvotes: 0
Reputation: 6023
There is no way currently to directly modify the launch of the docker containers on a Dataflow worker.
However, jobs with GPUs enabled do run their containers in privileged mode. This has cost implications but could be a way to experimentally confirm that the feature addresses your need. If you can share more about the specific use case, perhaps it will make sense to generalize this feature to non-GPU jobs.
Upvotes: 1