Reputation: 153
This is a dummy sample of my dataframe
data = [
[3273, "city y", [["ids", 27], ["smf", 13], ["tlk", 35], ["thr", 24]]],
[3213, "city x", [["smf", 23], ["tlk", 15], ["ids", 17], ["thr", 34]]],
]
df = spark.createDataFrame(
data, "city_id:long, city_name:string, cel:array<struct<carr:string, subs:int>>"
)
df.show(2, False)
+-------+---------+--------------------------------------------+
|city_id|city_name|cel |
+-------+---------+--------------------------------------------+
|3273 |city y |[[ids, 27], [smf, 13], [tlk, 35], [thr, 24]]|
|3213 |city x |[[smf, 23], [tlk, 15], [ids, 17], [thr, 34]]|
+-------+---------+--------------------------------------------+
I need to descendingly sort the array of column cel based on its subs value. It would be like this
+-------+---------+--------------------------------------------+
|city_id|city_name|cel |
+-------+---------+--------------------------------------------+
|3273 |city y |[[tlk, 35], [ids, 27], [thr, 24], [smf, 13]]|
|3213 |city x |[[thr, 34], [smf, 23], [ids, 17], [tlk, 15]]|
+-------+---------+--------------------------------------------+
Is there a way of doing it without using an UDF if possible? thanks
I am using spark version 2.4.0
Upvotes: 7
Views: 10684
Reputation: 15258
You can do it using some SQL lambda functions :
df = df.withColumn(
"cel",
F.expr(
"reverse(array_sort(transform(cel,x->struct(x['subs'] as subs,x['carr'] as carr))))"
),
)
df.show()
+-------+---------+--------------------------------------------+
|city_id|city_name|cel |
+-------+---------+--------------------------------------------+
|3273 |city y |[[35, tlk], [27, ids], [24, thr], [13, smf]]|
|3213 |city x |[[34, thr], [23, smf], [17, ids], [15, tlk]]|
+-------+---------+--------------------------------------------+
df.printSchema()
root
|-- city_id: long (nullable = true)
|-- city_name: string (nullable = true)
|-- cel: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- subs: integer (nullable = true)
| | |-- carr: string (nullable = true)
Upvotes: 7
Reputation: 5487
You can use sort_array() to sort an array column. But in case of array<struct>
column this will sort the first column. So we can swap the columns using transform function before using sort_array().
>>> df.withColumn('cel', expr('sort_array(transform(cel, c-> struct(c.subs, c.carr)), False)')).show(truncate=False)
+-------+---------+--------------------------------------------+
|city_id|city_name|cel |
+-------+---------+--------------------------------------------+
|3273 |city y |[{35, tlk}, {27, ids}, {24, thr}, {13, smf}]|
|3213 |city x |[{34, thr}, {23, smf}, {17, ids}, {15, tlk}]|
+-------+---------+--------------------------------------------+
Upvotes: 5