How to partition a dataframe on multiple columns and write the output to xlsx in Apache Spark

Question

I'm new to scala and I don't know how to ask this kind of question (technical word ...). I have a dataframe :

id     VehicleID         Longitude    Latitude     Date         Distance
1       12311            55.55431     25.45631     01/02/2020    20
2       12311            55.55432     25.45634     01/02/2020    80
3       12311            55.55433     25.45637     02/02/2020    10
4       12311            55.55431     25.45621     02/02/2020    50
5       12309            55.55427     25.45627     01/02/2020    30
6       12309            55.55436     25.45655     02/02/2020    20
7       12412            55.55441     25.45657     01/02/2020    14
8       12412            55.55442     25.45656     02/02/2020    60

i want to save each partition by vehicle_id and date in a different file. For Exampe

 Save tha on a xlsx file named : 12311_01/02/2020

 id     VehicleID         Longitude    Latitude     Date         Distance
 1       12311            55.55431     25.45631     01/02/2020    20
 2       12311            55.55432     25.45634     01/02/2020    80

 Save tha on a xlsx file named : 12311_02/02/2020

 id     VehicleID         Longitude    Latitude     Date         Distance
 3       12311            55.55433     25.45637     02/02/2020    10
 4       12311            55.55431     25.45621     02/02/2020    50

 Save tha on a xlsx file named : 12309_01/02/2020

 id     VehicleID         Longitude    Latitude     Date         Distance
 5       12309            55.55427     25.45627     01/02/2020    30

I managed to save it in a single file, but I cannot save in several files according to partition

df.cache().write.format("com.crealytics.spark.excel").option("header",true).save("test.xlsx")

There is any posibility to do that on spark with scala ?

Thanks .

Giorgos Myrianthous · Accepted Answer

In order to partition your dataframe on two columns, all you have to do is to call partitionBy() in order create more partitions and finally save the file to csv.

The following should do the trick for you:

df \
  .write \
  .partitionBy("VehicleID", "Date")
  .format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .save("test.xlsx") \

Alternatively, if you want the output to be in csv format:

df \
  .write \
  .partitionBy("VehicleID", "Date")
  .format("com.databricks.spark.csv") \
  .option("header", "true") \
  .save("test.csv") \

How to partition a dataframe on multiple columns and write the output to xlsx in Apache Spark

Answers (1)

Related Questions