Pyspark: Save dataframe to single csv by date using Window function?

Question

I have a dataframe like this:

df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
                   "Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
                   "Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
                   "Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
                   "Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})

df = spark.createDataFrame(df)

+----------+-----------+-----------+-----------+----+
|      Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10|         30|         60|        -30|   1|
|2020-05-10|         30|         60|        -30|   1|
|2020-05-10|         30|         60|        -30|   2|
|2020-05-11|         30|        120|        -60|   1|
|2020-05-11|         30|        120|        -60|   1|
|2020-05-11|         30|        120|        -60|   1|
|2020-05-11|         30|        120|        -60|   1|
|2020-05-11|         30|        120|        -60|   2|
|2020-05-11|         30|        120|        -60|   2|
+----------+-----------+-----------+-----------+----+

I know I can save a dataframe to a single csv file like this:

df.coalesce(1).write.format("com.databricks.spark.csv"
                                       ).mode('overwrite'
                                             ).option("header", "true"
                                               ).save("s3://mycsv_date.csv")

I would like to break my dataframe out by the date and save each filtered dataframe to csv.

mycsv_2020_05_10.csv
mycsv_2020_05_11.csv

What is the best way to do this?

Shubham Jain · Accepted Answer

Use

df.repartition('Date').write.partitionBy('Date').format("com.databricks.spark.csv"
                                       ).mode('overwrite'
                                             ).option("header", "true"
                                               ).save("s3://bucket/path")

Now you will have each date's folder with single file in each partition

Pyspark: Save dataframe to single csv by date using Window function?

Answers (1)

Related Questions