Multiple Sorted Output in spark java

Question

i am trying to figure out multiple sorted output using datasets in spark

Input :-

city1 A1

city2 A2

City1 C1

city2 B2

city1 B1

city2 C2

i want output to be first sorted on basis of both the columns and then store each sorted output in individual file

output:-

File1:

city1 A1

city1 B1

city1 c1

similarly file2 will contain the data of c2

Grisha Weintraub · Accepted Answer

The obvious way is to use partitonBy. The following code (in Scala) will produce a folder for each city with the required data.

val df = List(
  ("city1","A1"), 
  ("city2","A2"), 
  ("city1","C1"), 
  ("city2","B2"),
  ("city1","B1"),
  ("city2","C2"))
  .toDF("city","val")

df.sort("city", "val")
  .withColumn("city-part",col("city"))
  .coalesce(1)
  .write
  .partitionBy("city-part")
  .format("csv")
  .save("/output-path")

Note, that in order to have "city" column inside the output file, we add another column (city-part) with the same value to the data frame and use it for partitioning.

Multiple Sorted Output in spark java

Answers (1)

Related Questions