Reputation: 783
I would like to profile my sSpark scala applications to figure out the parts of the code which i have to optimize. I enabled -Xprof
in --driver-java-options
but this is not of much help to me as it gives lot of granular details. I am just interested to know how much time each function call in my application is taking time.
As is other Stack Overflow questions, many people suggested YourKit but it is not inexpensive. So i would like to use something which is not costly in fact free of cost.
Are there any better ways to solve this ?
Upvotes: 15
Views: 16685
Reputation: 325
Would suggest to check out sparklens. This is profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. It provides an overall idea about how efficiently your cluster resources are utilized and what effects(approximately) a change in cluster resource configuration could have on the performance.
Upvotes: 1
Reputation: 88
Look at JVM Profiler released by UBER.
JVM Profiler is a tool developed by UBER for analysing JVM applications in distributed environment. It can attach Java agent to executors of Spark/Hadoop application in a distributed way and collect various metrics at runtime. It allows to trace arbitrary java methods/arguments without source code change (similar to Dtrace).
Here is the blog post.
Upvotes: 3
Reputation: 37009
I've written an article and a script recently, that wraps spark-submit
, and generates a flame graph after executing a Spark application.
Here's the article: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector
Here's the script: https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph
Just use it instead of regular spark-submit
.
Upvotes: 7
Reputation: 171
As you said, profiling a distributed process is trickier than profiling a single JVM process, but there are ways to achieve this.
You can use sampling as a thread profiling method. Add a java agent to the executors that will capture stack traces, then aggregate over these stack traces to see which methods your application spends the most time in.
For example, you can use Etsy's statsd-jvm-profiler java agent and configure it to send the stack traces to InfluxDB and then aggregate them using Flame Graphs.
For more information, check out my post on profiling Spark applications: https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/
Upvotes: 8
Reputation: 6915
I would recommend you to use directly the UI that spark provides. It provides a lot of information and metrics regarding time, steps, network usage, etc...
You can check more about it here: https://spark.apache.org/docs/latest/monitoring.html
Also, in the new Spark version (1.4.0) there is a nice visualizer to understand the steps and stages of your spark jobs.
Upvotes: 10