YAKOVM
YAKOVM

Reputation: 10153

How to measure the execution time of a query on Spark

I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery

Is it a good way? The time that I get looks too small relative to when I see the table.

Upvotes: 16

Views: 59519

Answers (7)

Guy
Guy

Reputation: 154

You can also try using sparkMeasure which simplify the collection of performance metrics

Upvotes: 0

Amir Charkhi
Amir Charkhi

Reputation: 836

If you are using spark-shell (scala) you can use the time module:

import time
df = sqlContext.sql(query)
spark.time(df.show())

However, SparkSession.time() is not available in pyspark. For python, a simple solution would be to use time:

import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")

Upvotes: 2

Mehdi LAMRANI
Mehdi LAMRANI

Reputation: 11597

For those looking for / needing a python version
(as pyspark google search leads to this post) :

from time import time
from datetime import timedelta

class T():
    def __enter__(self):
        self.start = time()
    def __exit__(self, type, value, traceback):
        self.end = time()
        elapsed = self.end - self.start
        print(str(timedelta(seconds=elapsed)))

Usage :

with T():
    //spark code goes here

As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8

Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)

Upvotes: -2

Tyrone321
Tyrone321

Reputation: 1902

To do it in a spark-shell (Scala), you can use spark.time().

See another response by me: https://stackoverflow.com/a/50289329/3397114

df = sqlContext.sql(query)
spark.time(df.show())

The output would be:

+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms

Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.

Upvotes: 25

Sven Hafeneger
Sven Hafeneger

Reputation: 801

Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

Upvotes: 7

shridharama
shridharama

Reputation: 979

I use System.nanoTime wrapped around a helper function, like this -

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

time {
  df = sqlContext.sql(query)
  df.show()
}

Upvotes: 15

Sumit
Sumit

Reputation: 1420

SPARK itself provides much granular information about each stage of your Spark Job.

You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

Refer here for more info on History server.

Upvotes: 4

Related Questions