Reputation: 27373
Consider the following dataframe:
case class ArrayElement(id:Long,value:Double)
val df = Seq(
Seq(
ArrayElement(1L,-2.0),ArrayElement(2L,1.0),ArrayElement(0L,0.0)
)
).toDF("arr")
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- value: double (nullable = false)
Is there a way to sort arr
by value
other than using an udf?
I've seen org.apache.spark.sql.functions.sort_array
, what is this method actually doing in the case of complex array elements? Is it sorting the array by the first element (i.e. id
?)
Upvotes: 10
Views: 13730
Reputation: 41957
spark functions says "Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements."
Before I explain, lets look at some examples of what sort_array does.
+----------------------------+----------------------------+
|arr |sorted |
+----------------------------+----------------------------+
|[[1,-2.0], [2,1.0], [0,0.0]]|[[0,0.0], [1,-2.0], [2,1.0]]|
+----------------------------+----------------------------+
+----------------------------+----------------------------+
|arr |sorted |
+----------------------------+----------------------------+
|[[0,-2.0], [2,1.0], [0,0.0]]|[[0,-2.0], [0,0.0], [2,1.0]]|
+----------------------------+----------------------------+
+-----------------------------+-----------------------------+
|arr |sorted |
+-----------------------------+-----------------------------+
|[[0,-2.0], [2,1.0], [-1,0.0]]|[[-1,0.0], [0,-2.0], [2,1.0]]|
+-----------------------------+-----------------------------+
so sort_array is sorting by checking on the first element and then second and so on for each element in an array in the defined column
I hope its clear
Upvotes: 10