Reputation: 1949
I have a dataframe like this:
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
+----------+-----------+-----------+-----------+----+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 2|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 2|
|2020-05-11| 30| 120| -60| 2|
+----------+-----------+-----------+-----------+----+
I know I can save a dataframe to a single csv file like this:
df.coalesce(1).write.format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).save("s3://mycsv_date.csv")
I would like to break my dataframe out by the date and save each filtered dataframe to csv.
mycsv_2020_05_10.csv
mycsv_2020_05_11.csv
What is the best way to do this?
Upvotes: 0
Views: 1306
Reputation: 5536
Use
df.repartition('Date').write.partitionBy('Date').format("com.databricks.spark.csv"
).mode('overwrite'
).option("header", "true"
).save("s3://bucket/path")
Now you will have each date's folder with single file in each partition
Upvotes: 1