Wanda
Wanda

Reputation: 153

PySpark sorting array of struct

This is a dummy sample of my dataframe

data = [
    [3273, "city y", [["ids", 27], ["smf", 13], ["tlk", 35], ["thr", 24]]],
    [3213, "city x", [["smf", 23], ["tlk", 15], ["ids", 17], ["thr", 34]]],
]
df = spark.createDataFrame(
    data, "city_id:long, city_name:string, cel:array<struct<carr:string, subs:int>>"
)
df.show(2, False)

+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[ids, 27], [smf, 13], [tlk, 35], [thr, 24]]|
|3213   |city x   |[[smf, 23], [tlk, 15], [ids, 17], [thr, 34]]|
+-------+---------+--------------------------------------------+

I need to descendingly sort the array of column cel based on its subs value. It would be like this

+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[tlk, 35], [ids, 27], [thr, 24], [smf, 13]]|
|3213   |city x   |[[thr, 34], [smf, 23], [ids, 17], [tlk, 15]]|
+-------+---------+--------------------------------------------+

Is there a way of doing it without using an UDF if possible? thanks

I am using spark version 2.4.0

Upvotes: 7

Views: 10684

Answers (2)

Steven
Steven

Reputation: 15258

You can do it using some SQL lambda functions :

df = df.withColumn(
    "cel",
    F.expr(
        "reverse(array_sort(transform(cel,x->struct(x['subs'] as subs,x['carr'] as carr))))"
    ),
)

df.show()
+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[[35, tlk], [27, ids], [24, thr], [13, smf]]|
|3213   |city x   |[[34, thr], [23, smf], [17, ids], [15, tlk]]|
+-------+---------+--------------------------------------------+

df.printSchema()
root
 |-- city_id: long (nullable = true)
 |-- city_name: string (nullable = true)
 |-- cel: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- subs: integer (nullable = true)
 |    |    |-- carr: string (nullable = true)

Upvotes: 7

Mohana B C
Mohana B C

Reputation: 5487

You can use sort_array() to sort an array column. But in case of array<struct> column this will sort the first column. So we can swap the columns using transform function before using sort_array().

>>> df.withColumn('cel', expr('sort_array(transform(cel, c-> struct(c.subs, c.carr)), False)')).show(truncate=False)
+-------+---------+--------------------------------------------+
|city_id|city_name|cel                                         |
+-------+---------+--------------------------------------------+
|3273   |city y   |[{35, tlk}, {27, ids}, {24, thr}, {13, smf}]|
|3213   |city x   |[{34, thr}, {23, smf}, {17, ids}, {15, tlk}]|
+-------+---------+--------------------------------------------+

Upvotes: 5

Related Questions