Reputation: 20445
Problem
I have a pyspark job file in which certain dataframe is read from parquet file and some filtered are applied. these operations are common and i want them to be done only once. but i dont know that how can i pass the huge dataframe in function or correctly store it as global variable.
What i tried:
I have three option in mind. ut not sure if they are efficent or not
gives reference errors
).Persist/Cache
dataframe till these stepsCode:
def function1():
df_in_concern = sqlContext.read.parquet(...)
df_in_concern = df_in_concern.filter(...)
df_in_concern = df_in_concern.filter(...)
def function2():
df_in_concern = sqlContext.read.parquet(...)
df_in_concern = df_in_concern.filter(...)
df_in_concern = df_in_concern.filter(...)
def main():
function1()
function2()
if __name__ == "__main__":
main()
So, if there is some way to commonly access the df_in_concern
, it will avoid the heavy joins and reads again and again in different functions
Upvotes: 2
Views: 1762
Reputation: 2646
spark_dataframe.createOrReplaceTempView("tmp_table_name")
is probably your best option, use as following:
def read_table_first_time():
df1 = spark.createDataFrame([("val",)],["key"])
df1.createOrReplaceTempView("df1")
def read_table_again():
df_ref = spark.table("df1")
df_ref.show()
read_table_first_time()
read_table_again()
this outputs
+---+
|key|
+---+
|val|
+---+
Upvotes: 1