Cloud
Cloud

Reputation: 139

Dataproc initialization script error pip command not found error when using multiple initialization scripts

Following is the command i have used to create the dataproc cluster. There are two initialization scripts here. (1) jupyter.sh (2) my_initialize.sh

gcloud dataproc clusters create dproc \
    --subnet default --zone us-west1-a --project myproject \
    --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh,gs://mydataproc/my_initialize.sh \
    --master-machine-type n1-standard-8 --master-boot-disk-size 40 \
    --worker-machine-type n1-standard-8 --worker-boot-disk-size 40 --num-workers 4

Following is in my_initialize.sh

#!/usr/bin/env bash
pip install --upgrade google-cloud-bigquery

When we install jupyter.sh, i believe pip is already installed.

For some reason cluster creation is failed with the error as line 2: pip command not found.

Upvotes: 1

Views: 558

Answers (2)

tix
tix

Reputation: 2158

I believe this is an issue where the init action is not seeing changes to the environment from previous init actions. We will be rolling out a fix for this in next few weeks so sourcing profile.d should not be necessary after that. This will be announced in release notes.

In the mean time (as @Karthik Palaniappan mentions, just use pip by its full path /opt/conda/bin/pip.

Finally, on Dataproc 1.3 image you can use Anaconda+Jupyter Optional Components. Using components over init actions will cut down on overall cluster boot time.

Upvotes: 1

Karthik Palaniappan
Karthik Palaniappan

Reputation: 1383

Yeah, this is because neither pip nor anything else in /opt/conda/bin/ are in $PATH for your second init action. In fact, they don't end up on the path for the root user, even if you run sudo su root: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/issues/246.

If you're interested in fixing that issue, I'd be happy to accept a PR. Just as a starting point:bootstrap-conda.sh sets up /etc/profile.d/conda.sh here.

And other scripts source that file explicitly.

Unless there's a simple way to change $PATH systemwide, I think your best bet is to explicitly source /etc/profile.d/conda.sh as well.

Alternatively, run pip with its absolute path, e.g. /opt/conda/bin/pip install ....

Upvotes: 0

Related Questions