Reputation: 106
Apache Beam 2.1.0 added support for submitting jobs on the Dataflow runner on private subnetworks and without public IPs, which we needed to satisfy our firewall rules. I planned to use a squid proxy to access apt-get
, pip
, etc to install python dependencies; a proxy instance is already running and we set the proxies inside our setup.py script.
python $DIR/submit.py \
--runner DataflowRunner \
--no_use_public_ips \
--subnetwork regions/us-central1/subnetworks/$PRIVATESUBNET \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--project $PROJECT \
--setup_file $DIR/setup.py \
--job_name $JOB_NAME
When I try to run via the python API I error out during worker-startup before I get a chance to enable the proxy. It looks to me like each worker first tries to install the dataflow sdk:
and during that it tries to update requests
and fails to connect to pip
:
None of my code has been executed at this point, so I can't see a way to avoid this error before setting up the proxy. Is there any way to launch dataflow python workers on a private subnet?
Upvotes: 4
Views: 2475
Reputation: 898
This can now be done still using Cloud NAT, which looks like this:
($REGION_ID
is any GCP region, ex. us-central1
)
gcloud compute routers create nat-router \
--network=$NETWORK_NAME \
--region=$REGION_ID
gcloud compute routers nats create nat-config \
--router=nat-router \
--nat-custom-subnet-ip-ranges=$SUBNET \
--auto-allocate-nat-external-ips \
--region=$REGION_ID
If you need to assign a static IP address to Cloud NAT (to, perhaps, whitelist the NAT IP address in a firewall rule) you can do that as well:
gcloud compute addresses create nat-ip-address --network=$NETWORK_NAME
gcloud compute routers nats create nat-config \
--router=nat-router \
--nat-custom-subnet-ip-ranges=$SUBNET \
--nat-external-ip-pool=nat-ip-address # from above
--region=$REGION_ID
Resources: Creating Cloud NAT instance
Upvotes: 0
Reputation: 106
I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:
gcloud compute routes create no-ip-internet-route --network my-network \
--destination-range 0.0.0.0/0 \
--next-hop-instance nat-gateway \
--next-hop-instance-zone us-central1-a \
--tags dataflow --priority 800
I used the tag dataflow
instead of no-ip
, which is the network tag for all Dataflow workers.
The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.
Upvotes: 4