baxen
baxen

Reputation: 106

How to run Dataflow python on a private subnetwork?

Apache Beam 2.1.0 added support for submitting jobs on the Dataflow runner on private subnetworks and without public IPs, which we needed to satisfy our firewall rules. I planned to use a squid proxy to access apt-get, pip, etc to install python dependencies; a proxy instance is already running and we set the proxies inside our setup.py script.

python $DIR/submit.py \
       --runner DataflowRunner \
       --no_use_public_ips \
       --subnetwork regions/us-central1/subnetworks/$PRIVATESUBNET \
       --staging_location $BUCKET/staging \
       --temp_location $BUCKET/temp \
       --project $PROJECT \
       --setup_file $DIR/setup.py \
       --job_name $JOB_NAME

When I try to run via the python API I error out during worker-startup before I get a chance to enable the proxy. It looks to me like each worker first tries to install the dataflow sdk:

install_dataflow_sdk

and during that it tries to update requests and fails to connect to pip:

enter image description here

None of my code has been executed at this point, so I can't see a way to avoid this error before setting up the proxy. Is there any way to launch dataflow python workers on a private subnet?

Upvotes: 4

Views: 2475

Answers (2)

Nick
Nick

Reputation: 898

This can now be done still using Cloud NAT, which looks like this:

($REGION_ID is any GCP region, ex. us-central1)

gcloud compute routers create nat-router \
       --network=$NETWORK_NAME \
       --region=$REGION_ID

gcloud compute routers nats create nat-config \
   --router=nat-router \
   --nat-custom-subnet-ip-ranges=$SUBNET \
   --auto-allocate-nat-external-ips \
   --region=$REGION_ID

If you need to assign a static IP address to Cloud NAT (to, perhaps, whitelist the NAT IP address in a firewall rule) you can do that as well:

gcloud compute addresses create nat-ip-address --network=$NETWORK_NAME

gcloud compute routers nats create nat-config \
   --router=nat-router \
   --nat-custom-subnet-ip-ranges=$SUBNET \
   --nat-external-ip-pool=nat-ip-address # from above
   --region=$REGION_ID

Resources: Creating Cloud NAT instance

Upvotes: 0

baxen
baxen

Reputation: 106

I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:

gcloud compute routes create no-ip-internet-route --network my-network \
    --destination-range 0.0.0.0/0 \
    --next-hop-instance nat-gateway \
    --next-hop-instance-zone us-central1-a \
    --tags dataflow --priority 800

I used the tag dataflow instead of no-ip, which is the network tag for all Dataflow workers.

The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.

Upvotes: 4

Related Questions