nonoDa
nonoDa

Reputation: 453

DNS mapping for GCP DataFlow Kafka to BigQuery

Hello i'm trying to run a job to read some events from a kafka hosted outside GCP using dataflow, the job is ran on a VPC network.

Problem is the kafka is configured to answer with hostname instead of IPs, therefore specifying ips in bootstrap results in a failure to connect to the target node when running a job on dataflow.

Reader-1: Timeout while initializing partition 'placeholder'. Kafka client may not be able to connect to servers. 

On the other hand if i create a VM with kafka and specify in etc/host the mapping hostname-ip i'm able to consume correctly.
To make it work on Dataflow i tried to create a private Cloud DNS with DNS name = . this allows me to create the zone where i can map hosts in each entries DNS name = nodename1. to data = IP1.

This seems to work as i'm able to telnet nodename1 on a VM where i didn't specify the mapping in etc/host.

However the Job gets stucked in the beginning and the only error i get is:

"Timeout in polling result file: gs://placeholder/staging,zone=europe-central2-a/template_launches/2021-09-28_01_43_38-12780367774525067695/operation_result.
Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service account placeholder@placeholder may not have enough permissions to pull container image gcr.io/dataflow-templates/2021-09-20-00_rc00/kafka-to-bigquery or create new objects in gs://placeholder.
3. Transient errors occurred, please try again."

Is there an easy way to map hosts to ip for a job in dataflow?

Upvotes: 0

Views: 365

Answers (1)

nonoDa
nonoDa

Reputation: 453

I figured this one out.

The correct way was not to create a zone with DNS name=. , since this would mean that to resolve every domain GCP would look into this zone, not finding anything else other than the name-ip entries i created. Since Dataflow workers query 169.254.169.254 internally, this was forwarded to the zone i created and could not be resolved, resulting in the job hanging.

The correct way was to create a zone for each kafka "nodename" with the DNS name = nodename. and then map an entry nodename.-ip.

By repeating this for every nodename bootstrapped that dataflow was not able to resolve automatically i was able to correctly consume from the kafka topic.

Upvotes: 1

Related Questions