Colin Le Nost
Colin Le Nost

Reputation: 490

How to install private repository on Dataflow Worker?

We're facing issues during Dataflow jobs deployment.

The error

We are using CustomCommands to install private repo on workers, but we face now an error in the worker-startup logs of our jobs:

Running command: ['pip', 'install', 'git+ssh://[email protected]/[email protected]']

Command output: b'Traceback (most recent call last):
File "/usr/local/bin/pip", line 6, in <module>
from pip._internal import main\nModuleNotFoundError: No module named \'pip\'\n' 

This code was working but since our last deploy of the service on Friday, it's not.

Some context

  1. We use a GAE service with a cron job to deploy Dataflow Jobs, using the python sdk
  2. In our jobs, we're using code stored in private repository
  3. To allow the workers to pull private repositories, we use a setup.py with customCommands which are run during worker startup. (code example from official repo here)
  4. The commands retrieve an encoded ssh key from GCS, decode it with KMS, get a ssh config file to specify path of the key & allowed hosts then perform a pip install git+ssh://[email protected]/[email protected] (see commands below)

CUSTOM_COMMANDS = [
	# retrieve ssh key
    ["gsutil", "cp","gs://{bucket_name}/encrypted_python_repo_ssh_key".format(bucket_name=credentials_bucket), "encrypted_key"],
    [
        "gcloud",
        "kms",
        "decrypt",
        "--location",
        "global",
        "--keyring",
        project,
        "--key",
        project,
        "--plaintext-file",
        "decrypted_key",
        "--ciphertext-file",
        "encrypted_key",
    ],
    ["chmod", "700", "decrypted_key"],
    
    # install git & ssh
    ["apt-get", "update"],
    ["apt-get", "install", "-y", "openssh-server"],
    ["apt-get", "install", "-y", "git"],

    # Add ssh config which specify the location of the key & the host
    [
        "gsutil",
        "cp",
        "gs://{bucket_name}/ssh_config_gcloud".format(bucket_name=credentials_bucket),
        "~/.ssh/config",
    ],
    [
        "pip",
        "install",
        "git+ssh://[email protected]/[email protected]",
    ],
]

What we tried

To Note:

FROM gcr.io/google-appengine/python
RUN apt-get update && apt-get install -y openssh-server
RUN virtualenv /env -p python3.7

# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH

# Set credentials for git run pip to install all
# dependencies into the virtualenv.
... specify SSH KEY, host, to allow private git repo pull 

# Add the application source code.
ADD . /app
RUN pip install -r /app/requirements.txt && python /app/setup.py install && python /app/setup.py build
CMD gunicorn -b :$PORT main:app

Any idea about how to solve this issue, or any workaround available ?

Thanks for your help !

Edit

This seems mostly due to local state of the machine, or our computers.

After running some commands like python setup.py install or python setup.py build, I'm now unable to deploy jobs anymore (facing the same error during worker-startup as deployed by the service), but my colleague is still able to deploy jobs (same code, same branch, except excluded directories from .gitignore like build, dist, ...) which are running. In his case, CustomCommands are not run on job deployment (but workers are still able to use local packaged pipeline).

Any way to specify a compiled package to use by worker ? I was not able to find doc on that...

Workaround

As we were not able to pull private code from dataflow worker, we used the following workaround:

Commands used
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
pipeline_options.view_as(SetupOptions).extra_packages = ["./lib/my-package-1.0.0-py3-none-any.whl"]

Upvotes: 6

Views: 1626

Answers (1)

robertwb
robertwb

Reputation: 5104

For anything but non-trivial, public dependencies I would recommend using custom containers and installing all the dependencies ahead of time.

Upvotes: 2

Related Questions