Reputation: 21
Have been looking for cluster configs in JSON format to create a dataproc cluster(GCE) with Dataproc Metastore service and Spark-BQ dependency jars, unable to find any reference document that specifies how to use those JSON configs.
I have looked through below links : https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataproc_operator/index.html https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters https://cloud.google.com/dataproc/docs/reference/rest/v1/MetastoreConfig
but it does not specify GCE cluster configs, its REST API and GKE cluster configs Please see below configs that I am trying out to create a dataproc cluster :
CLUSTER_CONFIG = {
"gce_cluster_config": {
"internal_ip_only": True,
"metadata": {
"spark-bigquery-connector-version": spark_bq_connector_version
},
"service_account_scopes": [
service_account_scopes
],
"subnetwork_uri": subnetwork_uri,
"zone_uri": zone_uri
},
"initialization_actions": [
{
"executable_file": initialization_actions,
"execution_timeout": execution_timeout
}
],
"master_config": {
"disk_config": {
"boot_disk_size_gb": master_boot_disk_size_gb
},
"machine_type_uri": master_machine_type_uri
},
"metastore_config": {
"dataproc_metastore_service": dataproc_metastore
},
"software_config": {
"image_version": software_image_version
},
"worker_config": {
"disk_config": {
"boot_disk_size_gb": worker_boot_disk_size_gb
},
"machine_type_uri": worker_machine_type_uri,
"num_instances": worker_num_instances
}
}
Any lead would be really appreciated, please attach links to refer full config examples
Thanks !
Upvotes: 2
Views: 795
Reputation: 26488
As mentioned in this doc, external Hive metastore (non Dataproc Metastore service) needs to be specified through the hive:hive.metastore.uris
property. Note the hive:
prefix.
When creating the cluster with gcloud, if you add --log-http
:
$ gcloud dataproc clusters create ... \
--properties hive:hive.metastore.uris=thrift://my-metastore:9083 \
--log-http
it will show you the actual HTTP request:
{
"clusterName":"...",
"config":{
"endpointConfig":{
"enableHttpPortAccess":true
},
"gceClusterConfig":{
"internalIpOnly":false,
"serviceAccountScopes":[
"https://www.googleapis.com/auth/cloud-platform"
],
"zoneUri":"us-west1-a"
},
"masterConfig":{
"diskConfig":{
},
"machineTypeUri":"e2-standard-2"
},
"softwareConfig":{
"imageVersion":"1.5",
"properties":{
"hive:hive.metastore.uris":"thrift://my-metastore:9083"
}
},
"workerConfig":{
"diskConfig":{
},
"machineTypeUri":"e2-standard-2"
}
},
"projectId":"..."
}
You can also find the request spec in the Dataproc REST API doc.
Upvotes: 0