Reputation: 25
I have the following df:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[1, 8, 2, "bar"], [5, 9, 2, "foo"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I would like this sorted by the second element in descending order, so:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[5, 9, 2, "foo"], [1, 8, 2, "bar"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I came up with something like this :
def def_sort(x):
return sorted(x, key=lambda x:x[1], reverse=True)
udf_sort = F.udf(def_sort, T.ArrayType(T.ArrayType(T.IntegerType())))
df.select("id", udf_sort("id_info"))
I'm not sure how to handle null values like this, also is there maybe a built-in function for this? Can I somehow do it with F.array_sort
?
Upvotes: 1
Views: 1883
Reputation: 14845
The elements of the array contain integers and a string, so I assume that the column id_info
is an array of structs.
So the schema of the input data would be similiar to
root
|-- id: string (nullable = true)
|-- id_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col1: integer (nullable = true)
| | |-- col2: integer (nullable = true)
| | |-- col3: integer (nullable = true)
| | |-- col4: string (nullable = true)
The names of the elements of the struct might be different.
With this schema information we can use array_sort to order the array:
df.selectExpr("array_sort(id_info, (l,r) -> \
case when l['col2'] > r['col2'] then -1 else 1 end) as sorted") \
.show(truncate=False)
prints
+----------------------------------------------------------------------------------+
|sorted |
+----------------------------------------------------------------------------------+
|[{5, 9, 2, foo}, {1, 8, 2, bar}, {4, 3, 2, something}, {9, null, 2, this_is_null}]|
+----------------------------------------------------------------------------------+
Upvotes: 3
Reputation: 75080
You can try explode folowed by orderby on id and second element on descending order, then groupBy + collect_list:
out = (sdf.select("*",F.explode("id_info").alias("element"))
.withColumn("second_ele",F.element_at("element",2))
.orderBy(*["id",F.desc("second_ele")])
.groupBy("id").agg(F.collect_list("element").alias("id_info"))
)
out.show(truncate=False)
+----+-----------------------------------------------------------------------+
|id |id_info |
+----+-----------------------------------------------------------------------+
|id_1|[[5, 9, 2, null], [1, 8, 2, null], [4, 3, 2, null], [9, null, 2, null]]|
+----+-----------------------------------------------------------------------+
Upvotes: 3