Reputation: 125
I tried creating a Dataproc cluster both through Airflow and through the Google cloud UI, and the cluster creation always fails at the end. Following is the airflow code I am using to create the cluster -
# STEP 1: Libraries needed
from datetime import timedelta, datetime
from airflow import models
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule
from poc.utils.transform import main
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.operators.python_operator import BranchPythonOperator
import os
YESTERDAY = datetime.combine(
datetime.today() - timedelta(1),
datetime.min.time())
project_name = os.environ['GCP_PROJECT']
# Can pull in spark code from a gcs bucket
# SPARK_CODE = ('gs://us-central1-cl-composer-tes-fa29d311-bucket/spark_files/transformation.py')
dataproc_job_name = 'spark_job_dataproc'
default_dag_args = {
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'start_date': YESTERDAY,
'retry_delay': timedelta(minutes=5),
'project_id': project_name,
'owner': 'DataProc',
}
with models.DAG(
'dataproc-poc',
description='Dag to run a simple dataproc job',
schedule_interval=timedelta(days=1),
default_args=default_dag_args) as dag:
CLUSTER_NAME = 'dataproc-cluster'
def ensure_cluster_exists(ds, **kwargs):
cluster = DataProcHook().get_conn().projects().regions().clusters().get(
projectId=project_name,
region='us-east1',
clusterName=CLUSTER_NAME
).execute(num_retries=5)
print(cluster)
if cluster is None or len(cluster) == 0 or 'clusterName' not in cluster:
return 'create_dataproc'
else:
return 'run_spark'
# start = BranchPythonOperator(
# task_id='start',
# provide_context=True,
# python_callable=ensure_cluster_exists,
# )
print_date = BashOperator(
task_id='print_date',
bash_command='date'
)
create_dataproc = dataproc_operator.DataprocClusterCreateOperator(task_id='create_dataproc',
cluster_name=CLUSTER_NAME,
num_workers=2,
use_if_exists='true',
zone='us-east1-b',
master_machine_type='n1-standard-1',
worker_machine_type='n1-standard-1')
# Run the PySpark job
run_spark = dataproc_operator.DataProcPySparkOperator(
task_id='run_spark',
main=main,
cluster_name=CLUSTER_NAME,
job_name=dataproc_job_name
)
# dataproc_operator
# Delete Cloud Dataproc cluster.
# delete_dataproc = dataproc_operator.DataprocClusterDeleteOperator(
# task_id='delete_dataproc',
# cluster_name='dataproc-cluster-demo-{{ ds_nodash }}',
# trigger_rule=trigger_rule.TriggerRule.ALL_DONE)
# STEP 6: Set DAGs dependencies
# Each task should run after have finished the task before.
print_date >> create_dataproc >> run_spark
# print_date >> start >> create_dataproc >> run_spark
# start >> run_spark
I checked the cluster logs and saw the following errors -
Upvotes: 3
Views: 3143
Reputation: 36
Cannot start master: Timed out waiting for 2 datanodes and nodemanagers. Operation timed out: Only 0 out of 2 minimum required datanodes running. Operation timed out: Only 0 out of 2 minimum required node managers running.
This error suggests that the worker nodes are not able to communicate with the master node. When worker nodes are unable to report to master node in given timeframe, cluster creation fails.
Please check if you have set up correct firewall rules to allow communication among VMs.
You can refer to following for network configs best practices: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#overview
Upvotes: 1