Improve data wrangling performance in Spark.SQL

Question

I have a big database containing several csv files. Each csv file contains the last 10 days and only the oldest date is final data.

For example "file_2019-08-11.csv" file contains data from 08-02 until 08-11 ( only records with date 08-02 in the data are final) and "file_2019-08-12.csv" file contains data from 08-03 until 08-12 ( only records with date 08-03 are final).

I am using PySpark to do that. My aim is to keep only records for date 08-02 from variables_2019-08-11.csv file and records for date 08-03 from variables_2019-08-12.csv file and so on. I am using PySpark and Databricks to do that , my snippet is working but is a bit slow , although I am running it on big enough cluster.

I would gladly take suggestions for other scenarios to improve its performance. Thanks

    import datetime
    # define the period range
    start_date="2019-08-12"
    end_date="2019-08-30



# create list of dates under date_generated variable

    start = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.datetime.strptime(end_date, "%Y-%m-%d")
    date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]

# read first file

    filename="file_variables_"+str(date_generated[0])[0:10]+".csv"
    df=spark.read.csv(data_path+filename,header="true")
    df.createOrReplaceTempView("df")

#create the main file which we will use the other dates to append below this one

    final=spark.sql("select * from df where data_date in (select min(data_date) from df)")

#loop on other dates than the first date 

    for date in date_generated[1:len(date_generated)]:
      filename="file_variables_"+str(date)[0:10]+".csv"
      df=spark.read.csv(data_path+filename,header="true")
      df.createOrReplaceTempView("df")
      temp=spark.sql("select * from df where data_date in (select min(data_date) from df)")
      final=final.union(temp)
    final.createOrReplaceTempView("final")

Improve data wrangling performance in Spark.SQL

Answers (1)

Related Questions