Missing Metrics for Apache Beam Pipeline (via SparkRunner / Dataproc)

Question

I'm currently in the process of adding some metrics to an existing pipeline that runs on Google Dataproc via the Spark Runner and I'm trying to determine how to access these metrics and eventually expose them to Stackdriver (to be used downstream in Grafana dashboards).

The metrics themselves are fairly simple (a series of counters) and are defined as such (and accessed in DoFns throughout the pipeline):

object Metrics {
   val exampleMetric: Counter = Metrics.counter(ExamplePipeline::class.qualifiedName, "count")

   // Others omitted for brevity
}

This metric (and others) are incremented throughout the course of the pipeline in various DoFn calls and several unit tests confirm that the MetricQueryResults object from the pipeline is properly populated after executing via the DirectRunner.

The primary issue here is that I see no indication within Dataproc or any of the related UIs exposed in GCP (YARN ResourceManager, Spark History Server, YARN Application Timeline, etc.) that these metrics are being emitted. I've tried scouring through logs and anywhere else that I can, but I don't see any sign of these custom metrics (or really any metrics in general being emitted from Spark and/or into Stackdriver).

Job Configuration

The Spark job itself is configured through the following command in a script (assuming that the appropriate .jar file has been copied up into the proper bucket in GCP:

gcloud dataproc jobs submit spark --jar $bucket/deployment/example-pipeline.jar \
       --project $project_name \
       --cluster $cluster_name \
       --region $region  \
       --id pipeline-$timestamp \
       --driver-log-levels $lots_of_things_here \
       --properties=spark.dynamicAllocation.enabled=false \
       --labels="type"="example-pipeline","namespace"="$namespace" \
       --async \
       -- \
         --runner=SparkRunner \
         --streaming

Cluster Configuration

The cluster itself appears to have every metric-related property enabled that I could think of such as:

dataproc:dataproc.logging.stackdriver.enable=true
dataproc:dataproc.logging.stackdriver.job.driver.enable=true
dataproc:dataproc.monitoring.stackdriver.enable=true
dataproc:spark.submit.deployMode=cluster
spark:spark.eventLog.dir=hdfs:///var/log/spark/apps
spark:spark.eventLog.enabled=true
yarn:yarn.log-aggregation-enable=true
yarn:yarn.log-aggregation.retain-seconds=-1

Those were just a few of the properties on the cluster, however there are countless others, so if one appears to be missing or incorrect (as it pertains to the metrics story), feel free to ask.

Questions

It doesn't seem as though these metrics are automatically emitted or visible from Spark (or in Stackdriver), is there some missing configuration there at the cluster/job level? Or something like the MetricsOptions interface?
Once we have metrics actually being emitted, I'd assume that Stackdriver has a mechanism to handle consuming these from DataProc (which seemed to sound like what dataproc:dataproc.monitoring.stackdriver.enable=true would handle). Is that the case?

I have to imagine this is a fairly common use-case (for Spark / Dataproc / Beam), but I'm unsure of which pieces of the configuration puzzle are missing and documentation/articles related to this process seem quite sparse.

Thanks in advance!

Dagang Wei · Accepted Answer

Unfortunately, as of today Dataproc doesn't have StackDriver integration for Spark system and custom metrics.

Spark system metrics can be enabled by configuring /etc/spark/conf/metrics.properties (you can copy from /etc/spark/conf/metrics.properties.template ) or through cluster/job properties. See more info in this doc. At the best, you can have these metrics available as CSV files or HTTP services in the cluster, but there is no integration with StackDriver yet.

For Spark custom metrics, you might need to implement your own source, like this question, then it can be made available in the cluster as system metrics as above.

Missing Metrics for Apache Beam Pipeline (via SparkRunner / Dataproc)

Job Configuration

Cluster Configuration

Questions

Answers (1)

Related Questions