Reputation: 348
i am trying to figure out multiple sorted output using datasets in spark
Input :-
city1 A1
city2 A2
City1 C1
city2 B2
city1 B1
city2 C2
i want output to be first sorted on basis of both the columns and then store each sorted output in individual file
output:-
File1:
city1 A1
city1 B1
city1 c1
similarly file2 will contain the data of c2
Upvotes: 0
Views: 105
Reputation: 7996
The obvious way is to use partitonBy
. The following code (in Scala) will produce a folder for each city with the required data.
val df = List(
("city1","A1"),
("city2","A2"),
("city1","C1"),
("city2","B2"),
("city1","B1"),
("city2","C2"))
.toDF("city","val")
df.sort("city", "val")
.withColumn("city-part",col("city"))
.coalesce(1)
.write
.partitionBy("city-part")
.format("csv")
.save("/output-path")
Note, that in order to have "city" column inside the output file, we add another column (city-part) with the same value to the data frame and use it for partitioning.
Upvotes: 1