pyspark python dataframe reuse in different functions

Question

Problem

I have a pyspark job file in which certain dataframe is read from parquet file and some filtered are applied. these operations are common and i want them to be done only once. but i dont know that how can i pass the huge dataframe in function or correctly store it as global variable.

What i tried:

I have three option in mind. ut not sure if they are efficent or not

Pass this dataframe to each function
Define this dataframe as empty in main and access/modify it in other functions(not sure as it gives reference errors).
Persist/Cache dataframe till these steps

Code:

def function1():
       df_in_concern = sqlContext.read.parquet(...)
       df_in_concern = df_in_concern.filter(...)
       df_in_concern = df_in_concern.filter(...)

def function2():
       df_in_concern = sqlContext.read.parquet(...)
       df_in_concern = df_in_concern.filter(...)
       df_in_concern = df_in_concern.filter(...)

def main():
     function1()
     function2()


if __name__ == "__main__":

    main()

So, if there is some way to commonly access the df_in_concern, it will avoid the heavy joins and reads again and again in different functions

avloss · Accepted Answer

spark_dataframe.createOrReplaceTempView("tmp_table_name") is probably your best option, use as following:

def read_table_first_time():
    df1 = spark.createDataFrame([("val",)],["key"])
    df1.createOrReplaceTempView("df1")

def read_table_again():
    df_ref = spark.table("df1")
    df_ref.show()

read_table_first_time()
read_table_again()

this outputs

+---+
|key|
+---+
|val|
+---+

pyspark python dataframe reuse in different functions

Answers (1)

Related Questions