Reputation: 10153
I need to measure the execution time of query on Apache spark (Bluemix). What I tried:
import time
startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery
Is it a good way? The time that I get looks too small relative to when I see the table.
Upvotes: 16
Views: 59519
Reputation: 154
You can also try using sparkMeasure which simplify the collection of performance metrics
Upvotes: 0
Reputation: 836
If you are using spark-shell (scala) you can use the time
module:
import time
df = sqlContext.sql(query)
spark.time(df.show())
However, SparkSession.time()
is not available in pyspark
. For python
, a simple solution would be to use time
:
import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")
Upvotes: 2
Reputation: 11597
For those looking for / needing a python version
(as pyspark google search leads to this post) :
from time import time
from datetime import timedelta
class T():
def __enter__(self):
self.start = time()
def __exit__(self, type, value, traceback):
self.end = time()
elapsed = self.end - self.start
print(str(timedelta(seconds=elapsed)))
Usage :
with T():
//spark code goes here
As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8
Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)
Upvotes: -2
Reputation: 1902
To do it in a spark-shell (Scala), you can use spark.time()
.
See another response by me: https://stackoverflow.com/a/50289329/3397114
df = sqlContext.sql(query)
spark.time(df.show())
The output would be:
+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms
Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.
Upvotes: 25
Reputation: 801
Update:
No, using time
package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.
On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.
Upvotes: 7
Reputation: 979
I use System.nanoTime
wrapped around a helper function, like this -
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: "+(System.nanoTime-s)/1e6+"ms")
ret
}
time {
df = sqlContext.sql(query)
df.show()
}
Upvotes: 15
Reputation: 1420
SPARK itself provides much granular information about each stage of your Spark Job.
You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.
Refer here for more info on History server.
Upvotes: 4