Raphael Roth
Raphael Roth

Reputation: 27373

Sort Array of structs in Spark DataFrame

Consider the following dataframe:

case class ArrayElement(id:Long,value:Double)

val df = Seq(
  Seq(
    ArrayElement(1L,-2.0),ArrayElement(2L,1.0),ArrayElement(0L,0.0)
  )
).toDF("arr")

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- value: double (nullable = false)

Is there a way to sort arr by value other than using an udf?

I've seen org.apache.spark.sql.functions.sort_array, what is this method actually doing in the case of complex array elements? Is it sorting the array by the first element (i.e. id?)

Upvotes: 10

Views: 13730

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

spark functions says "Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements."

Before I explain, lets look at some examples of what sort_array does.

+----------------------------+----------------------------+
|arr                         |sorted                      |
+----------------------------+----------------------------+
|[[1,-2.0], [2,1.0], [0,0.0]]|[[0,0.0], [1,-2.0], [2,1.0]]|
+----------------------------+----------------------------+

+----------------------------+----------------------------+
|arr                         |sorted                      |
+----------------------------+----------------------------+
|[[0,-2.0], [2,1.0], [0,0.0]]|[[0,-2.0], [0,0.0], [2,1.0]]|
+----------------------------+----------------------------+

+-----------------------------+-----------------------------+
|arr                          |sorted                       |
+-----------------------------+-----------------------------+
|[[0,-2.0], [2,1.0], [-1,0.0]]|[[-1,0.0], [0,-2.0], [2,1.0]]|
+-----------------------------+-----------------------------+

so sort_array is sorting by checking on the first element and then second and so on for each element in an array in the defined column

I hope its clear

Upvotes: 10

Related Questions