Logging Attached Cluster Information in Databricks / Spark

Question

I would like to do some performance testing on Databricks. To do this I would like to log what cluster (VM type e.g. Standard_DS3_v2) I was using during the test (we can assume that the driver and worker nodes are the same). I know I could log the no of workers, no of cores (on the driver at least) and the memory (on the driver at least). However, I would like to know the VM type since I want to be able to identify if I used e.g. a storage optimized or general purpose cluster. Instead of the VM Type this information would also be fine. Optimally, I can get this information as a string in a variable within the notebook to later write it into my log file from there with other information I am logging. However, I am also happy with any hacky workaround if there is no straight forward solution to this.

Alex Ott · Accepted Answer

You can get this information from the REST API, via GET request to Clusters API. You can use notebook context to identify the cluster where the notebook is running via dbutils.notebook.getContext call that returns a map of different attributes, including the cluster ID, workspace domain name, and you can extract the authentication token from it. Here is the code that prints driver & worker node types (it's in Python, but Scala code should be familiar - I often use Scala's dbutils.notebook.getContext.tags to find what tags are available):

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = "your_PAT_token"
cluster_id = ctx.tags().get("clusterId").get()

response = requests.get(
    f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
    headers={'Authorization': f'Bearer {host_token}'}
  ).json()
print(f"driver type={response['driver_node_type_id']} worker type={response['node_type_id']}")

Logging Attached Cluster Information in Databricks / Spark

Answers (1)

Related Questions