Zsolt Pentek
Zsolt Pentek

Reputation: 25

pyspark sort array of it's array's value

I have the following df:

+--------------------+
|  id|        id_info|
+--------------------+
|id_1| [[1, 8, 2, "bar"], [5, 9, 2, "foo"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |

I would like this sorted by the second element in descending order, so:

+--------------------+
|  id|        id_info|
+--------------------+
|id_1| [[5, 9, 2, "foo"], [1, 8, 2, "bar"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |

I came up with something like this :

def def_sort(x):
        return sorted(x, key=lambda x:x[1], reverse=True)

udf_sort = F.udf(def_sort, T.ArrayType(T.ArrayType(T.IntegerType())))
df.select("id", udf_sort("id_info"))

I'm not sure how to handle null values like this, also is there maybe a built-in function for this? Can I somehow do it with F.array_sort?

Upvotes: 1

Views: 1883

Answers (2)

werner
werner

Reputation: 14845

The elements of the array contain integers and a string, so I assume that the column id_info is an array of structs.

So the schema of the input data would be similiar to

root
 |-- id: string (nullable = true)
 |-- id_info: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col1: integer (nullable = true)
 |    |    |-- col2: integer (nullable = true)
 |    |    |-- col3: integer (nullable = true)
 |    |    |-- col4: string (nullable = true)

The names of the elements of the struct might be different.

With this schema information we can use array_sort to order the array:

df.selectExpr("array_sort(id_info, (l,r) -> \
    case when l['col2'] > r['col2'] then -1 else 1 end) as sorted") \
    .show(truncate=False)

prints

+----------------------------------------------------------------------------------+
|sorted                                                                            |
+----------------------------------------------------------------------------------+
|[{5, 9, 2, foo}, {1, 8, 2, bar}, {4, 3, 2, something}, {9, null, 2, this_is_null}]|
+----------------------------------------------------------------------------------+

Upvotes: 3

anky
anky

Reputation: 75080

You can try explode folowed by orderby on id and second element on descending order, then groupBy + collect_list:

out = (sdf.select("*",F.explode("id_info").alias("element"))
        .withColumn("second_ele",F.element_at("element",2))
        .orderBy(*["id",F.desc("second_ele")])
        .groupBy("id").agg(F.collect_list("element").alias("id_info"))
       )

out.show(truncate=False)

+----+-----------------------------------------------------------------------+
|id  |id_info                                                                |
+----+-----------------------------------------------------------------------+
|id_1|[[5, 9, 2, null], [1, 8, 2, null], [4, 3, 2, null], [9, null, 2, null]]|
+----+-----------------------------------------------------------------------+

Upvotes: 3

Related Questions