Reputation: 348
I am trying to deploy an Dataflow Job on a GCP VM that will have access to GCP resources but will not have internet access. When I try to run the job I get a connection timeout error, which would make sense if I were trying to connect to the internet. The code breaks because an http connection is being attempted on behalf of apache-beam.
Python Set up: Before cutting off the VM, I installed all necessary packages using pip and a requirements.txt. This seemed to work because other parts of the code work fine.
The following is the error message I receive when I run the code.
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None))
after connection broken by 'ConnectTimeoutError(
<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at foo>,
'Connection to pypi.org timed out. (connect timeout=15)')': /simple/apache-beam/
Could not find a version that satisfies the requirement apache-beam==2.9.0 (from versions: )
No matching distribution found for apache-beam==2.9.0
I if you are running a python job does it need the to connect to pypi? Is there a hack around this?
Upvotes: 1
Views: 1445
Reputation: 41
TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a SetupOption sdk_location variable in your Dataflow pipeline.
I was also struggling for a long time with this setup. Finally I found a solution which does not need internet access while execution.
There are probably multiple ways to do that, but the following two are rather simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz
in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
COPY utils/apache-beam-2.28.0.tar.gz /tmp
, because this is going to be the path you can set in your SetupOptions.FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
gcloud builds submit --tag $TEMPLATE_IMAGE .
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow@your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location
, but it didn't work for me. But the following should work:
gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
/tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Upvotes: 3
Reputation: 302
When we use google cloud composer with private ip enabled, we don't have access to internet.
To solve this:
Upvotes: 0
Reputation: 236
If you run a DataflowPythonOperator in a private Cloud Composer, the job needs to access the internet to download a set of packages from the image projects/dataflow-service-producer-prod
. But within the private cluster, VMs and GKEs don't have access to the internet.
To solve this problem, you need to create a Cloud NAT and a router: https://cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using
This will allow your instances to send packets to the internet and receive inbound traffic.
Upvotes: 0