Reputation: 2083
Is there a way to sort data inside a cell of a dataframe ? For example, I have a dataframe which contains two columns, colA & colB with data as following:
------|--------
|ColA | ColB |
------|--------
|1ABC | 101ATP |
|2BCA | ZER987 |
---------------
Is there a way to order the data present in cells of ColB using dataframe SQL where output looks like:
------|--------
|colA | ColB |
------|--------
|1ABC | 011APT |
|2BCA | 789ERZ |
---------------
Upvotes: 0
Views: 158
Reputation: 3344
Since Spark 2.4 you can do it using array_sort
and array_join
. In PySpark the query can look like this:
from pyspark.sql.functions import split, array_sort, array_join
l = [('1ABC', '101ATP'), ('2BCA', 'ZER987')]
df = spark.createDataFrame(l, ['ColA', 'ColB'])
(
df
.withColumn('x', split('ColB', ''))
.withColumn('sorted', array_sort('x'))
.withColumn('joined', array_join('sorted', ''))
).show()
+----+------+--------------------+--------------------+------+
|ColA| ColB| x| sorted|joined|
+----+------+--------------------+--------------------+------+
|1ABC|101ATP|[1, 0, 1, A, T, P, ]|[, 0, 1, 1, A, P, T]|011APT|
|2BCA|ZER987|[Z, E, R, 9, 8, 7, ]|[, 7, 8, 9, E, R, Z]|789ERZ|
+----+------+--------------------+--------------------+------+
Upvotes: 2
Reputation: 1724
You could achieve it by using udf
val ds = Seq(
("1ABC", "101ATP"),
("2BCA", "ZER987")
).toDF("col_a", "col_b")
val sortUdf = spark.udf.register("sort", (value: String) => value.sorted)
ds.select(sortUdf($"col_a").as("col_a"), sortUdf($"col_b").as("col_b"))
.show(false)
+-----+------+
|col_a|col_b |
+-----+------+
|1ABC |011APT|
|2ABC |789ERZ|
+-----+------+
Upvotes: 2