ramanKC
ramanKC

Reputation: 317

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster). I want to fetch spark job statistics using REST APIs in json format. I am able to fetch basic statistics using REST url call: http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.

However I want to fetch per executor or per RDD based statistics. How to do that using REST calls and where I can find the exact REST url to get these statistics. Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.

5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.

but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format. On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.

I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.

Upvotes: 6

Views: 4129

Answers (3)

Emaad Ahmed Manzoor
Emaad Ahmed Manzoor

Reputation: 493

I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.

The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.

It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.

Upvotes: 3

Sachin
Sachin

Reputation: 26

You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

Upvotes: 1

user5728085
user5728085

Reputation: 71

You should be able to access the Spark REST API using:

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/

From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors

I verified this with my spark streaming application that is running in yarn cluster mode.

I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).

First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.

Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.

Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.

You should see a JSON response with the application name supplied to SparkConf and the start time of the application.

Upvotes: 7

Related Questions