PySpark analyse execution time of queries

Question

I use an Docker image with a jupyter / pyspark notebook and run different queries on a huge dataset. I utilize RDD`s aswell as DataFrames and I would like to analyse the execution time of various queries. These queries might be either nested inside some function

def get_rdd_pair(rdd):

    rdd = rdd.map(lambda x: (x[0], x[1])
          .flatMapValues(lambda x: x)

    return rdd

or like so:

df = df.select(df.column1, explode(df.column2))

I hope you get the idea. I am looking for a way now to reliable measure the total execution time. I have tried writing a decorator combined with the time module. The problem is, that this only works fur functions like get_rdd_pair(rdd) and these are called a lot (for each line) if I use something like

rdd = rdd.map(get_rdd_pair)

So this did not work at all, any ideas? I heard of SparkMeasure, but it seems rather involved to get it running with Docker and might not be worth the effort.

PySpark analyse execution time of queries

Answers (1)

Related Questions