Reputation: 179
I am looking to setup a dataproc cluster with Jupter optional component .
gcloud beta dataproc clusters create cluster-1ea3 --enable-component-gateway \
--region europe-west1 --subnet data-network --no-address --zone europe-west1-b \
--single-node --master-machine-type n1-standard-4 --master-boot-disk-size 500 \
--image-version 1.5-debian10 --optional-components ANACONDA,JUPYTER \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project clouddemoenvironment
"--no-address" ensures private IP and the network "data-network" is enabled with Google private access. Things works great if i am not installing Jupyter optional component but cluster fails to come up with below error with optional components.
<13>Nov 5 09:01:44 google-dataproc-startup[1466]: <13>Nov 5 09:01:44 activate-component-jupyter[2710]: Looking in links: /opt/dataproc/jupyter/gcp
<13>Nov 5 09:01:44 google-dataproc-startup[1466]: <13>Nov 5 09:01:44 activate-component-jupyter[2710]: Collecting https://github.com/GoogleCloudPlatform/jupyter-extensions/archive/2cb9d24fe01cd329a8c4352a07b0eb8f9771fb07.zip#subdirectory=jupyter-gcs-contents-manager (from -r /opt/dataproc/jupyter/jupyter_extra_packages.requirements (line 1))
<13>Nov 5 09:01:59 google-dataproc-startup[1466]: <13>Nov 5 09:01:59 activate-component-jupyter[2710]: WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f6b1afbac10>, 'Connection to github.com timed out. (connect timeout=15)')': /GoogleCloudPlatform/jupyter-extensions/archive/2cb9d24fe01cd329a8c4352a07b0eb8f9771fb07.zip
I understand that the cluster has no access to github and it makes sense to fail. On checking the documentation it was quoted
If you create a Dataproc cluster with internal IP addresses only, attempts to access github.com over the Internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.
I don't want to use Cloud NAT or Cloud VPN . Is there some thing i can convey to the system to resolve the dependency in a different way . Unfortunately the initialization script concept might also wont work as the order of execution comes after the Optional components.
Any suggestions how I can leverage Optional Components in a non internet environment.
Regards, Jill
Upvotes: 2
Views: 471
Reputation: 381
This startup time dependency is a bug in the latest Dataproc images.
It should be fixed with the next Dataproc subminor image version release.
To work around this for now, you can use the previous subminor image version. (--image-version=1.5.18-debian10
)
UPDATE: this problem has been fixed in the Nov 9 2020 release, so you can just use the latest version.
Upvotes: 2