svKris
svKris

Reputation: 783

Profiling a Scala Spark application

I would like to profile my sSpark scala applications to figure out the parts of the code which i have to optimize. I enabled -Xprof in --driver-java-options but this is not of much help to me as it gives lot of granular details. I am just interested to know how much time each function call in my application is taking time. As is other Stack Overflow questions, many people suggested YourKit but it is not inexpensive. So i would like to use something which is not costly in fact free of cost.

Are there any better ways to solve this ?

Upvotes: 15

Views: 16685

Answers (5)

apnith
apnith

Reputation: 325

Would suggest to check out sparklens. This is profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. It provides an overall idea about how efficiently your cluster resources are utilized and what effects(approximately) a change in cluster resource configuration could have on the performance.

Upvotes: 1

pkumbhar
pkumbhar

Reputation: 88

Look at JVM Profiler released by UBER.

JVM Profiler is a tool developed by UBER for analysing JVM applications in distributed environment. It can attach Java agent to executors of Spark/Hadoop application in a distributed way and collect various metrics at runtime. It allows to trace arbitrary java methods/arguments without source code change (similar to Dtrace).

Here is the blog post.

Upvotes: 3

Michael Spector
Michael Spector

Reputation: 37009

I've written an article and a script recently, that wraps spark-submit, and generates a flame graph after executing a Spark application.

Here's the article: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector

Here's the script: https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph

Just use it instead of regular spark-submit.

Upvotes: 7

aviemzur
aviemzur

Reputation: 171

As you said, profiling a distributed process is trickier than profiling a single JVM process, but there are ways to achieve this.

You can use sampling as a thread profiling method. Add a java agent to the executors that will capture stack traces, then aggregate over these stack traces to see which methods your application spends the most time in.

For example, you can use Etsy's statsd-jvm-profiler java agent and configure it to send the stack traces to InfluxDB and then aggregate them using Flame Graphs.

For more information, check out my post on profiling Spark applications: https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/

Upvotes: 8

hveiga
hveiga

Reputation: 6915

I would recommend you to use directly the UI that spark provides. It provides a lot of information and metrics regarding time, steps, network usage, etc...

You can check more about it here: https://spark.apache.org/docs/latest/monitoring.html

Also, in the new Spark version (1.4.0) there is a nice visualizer to understand the steps and stages of your spark jobs.

Upvotes: 10

Related Questions