kklaw
kklaw

Reputation: 535

PySpark analyse execution time of queries

I use an Docker image with a jupyter / pyspark notebook and run different queries on a huge dataset. I utilize RDD`s aswell as DataFrames and I would like to analyse the execution time of various queries. These queries might be either nested inside some function

def get_rdd_pair(rdd):

    rdd = rdd.map(lambda x: (x[0], x[1])
          .flatMapValues(lambda x: x)

    return rdd

or like so:

df = df.select(df.column1, explode(df.column2))

I hope you get the idea. I am looking for a way now to reliable measure the total execution time. I have tried writing a decorator combined with the time module. The problem is, that this only works fur functions like get_rdd_pair(rdd) and these are called a lot (for each line) if I use something like

rdd = rdd.map(get_rdd_pair)

So this did not work at all, any ideas? I heard of SparkMeasure, but it seems rather involved to get it running with Docker and might not be worth the effort.

Upvotes: 1

Views: 4376

Answers (1)

Kyungjin Jung
Kyungjin Jung

Reputation: 31

SparkSession.time() is not available in Pyspark.

Instead of that, import time and measure it

import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")

Upvotes: 2

Related Questions