Reputation: 535
I use an Docker image with a jupyter / pyspark notebook and run different queries on a huge dataset. I utilize RDD`s aswell as DataFrames and I would like to analyse the execution time of various queries. These queries might be either nested inside some function
def get_rdd_pair(rdd):
rdd = rdd.map(lambda x: (x[0], x[1])
.flatMapValues(lambda x: x)
return rdd
or like so:
df = df.select(df.column1, explode(df.column2))
I hope you get the idea. I am looking for a way now to reliable measure the total execution time. I have tried writing a decorator combined with the time
module. The problem is, that this only works fur functions like get_rdd_pair(rdd)
and these are called a lot (for each line) if I use something like
rdd = rdd.map(get_rdd_pair)
So this did not work at all, any ideas? I heard of SparkMeasure
, but it seems rather involved to get it running with Docker and might not be worth the effort.
Upvotes: 1
Views: 4376
Reputation: 31
SparkSession.time() is not available in Pyspark.
Instead of that, import time and measure it
import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")
Upvotes: 2