A.B
A.B

Reputation: 20445

pyspark python dataframe reuse in different functions

Problem

I have a pyspark job file in which certain dataframe is read from parquet file and some filtered are applied. these operations are common and i want them to be done only once. but i dont know that how can i pass the huge dataframe in function or correctly store it as global variable.

What i tried:

I have three option in mind. ut not sure if they are efficent or not

  1. Pass this dataframe to each function
  2. Define this dataframe as empty in main and access/modify it in other functions(not sure as it gives reference errors).
  3. Persist/Cache dataframe till these steps

Code:

def function1():
       df_in_concern = sqlContext.read.parquet(...)
       df_in_concern = df_in_concern.filter(...)
       df_in_concern = df_in_concern.filter(...)

def function2():
       df_in_concern = sqlContext.read.parquet(...)
       df_in_concern = df_in_concern.filter(...)
       df_in_concern = df_in_concern.filter(...)

def main():
     function1()
     function2()


if __name__ == "__main__":

    main()

So, if there is some way to commonly access the df_in_concern, it will avoid the heavy joins and reads again and again in different functions

Upvotes: 2

Views: 1762

Answers (1)

avloss
avloss

Reputation: 2646

spark_dataframe.createOrReplaceTempView("tmp_table_name") is probably your best option, use as following:

def read_table_first_time():
    df1 = spark.createDataFrame([("val",)],["key"])
    df1.createOrReplaceTempView("df1")

def read_table_again():
    df_ref = spark.table("df1")
    df_ref.show()

read_table_first_time()
read_table_again()

this outputs

+---+
|key|
+---+
|val|
+---+

Upvotes: 1

Related Questions