Metadata
Metadata

Reputation: 2083

How to sort data in a cell of a dataframe?

Is there a way to sort data inside a cell of a dataframe ? For example, I have a dataframe which contains two columns, colA & colB with data as following:

------|--------
|ColA | ColB   |
------|--------
|1ABC | 101ATP |
|2BCA | ZER987 |
---------------

Is there a way to order the data present in cells of ColB using dataframe SQL where output looks like:

------|--------
|colA | ColB   |
------|--------
|1ABC | 011APT |
|2BCA | 789ERZ |
---------------

Upvotes: 0

Views: 158

Answers (2)

David Vrba
David Vrba

Reputation: 3344

Since Spark 2.4 you can do it using array_sort and array_join. In PySpark the query can look like this:

from pyspark.sql.functions import split, array_sort, array_join

l = [('1ABC', '101ATP'), ('2BCA', 'ZER987')]
df = spark.createDataFrame(l, ['ColA', 'ColB'])

(
  df
  .withColumn('x', split('ColB', ''))
  .withColumn('sorted', array_sort('x'))
  .withColumn('joined', array_join('sorted', ''))
).show()

+----+------+--------------------+--------------------+------+
|ColA|  ColB|                   x|              sorted|joined|
+----+------+--------------------+--------------------+------+
|1ABC|101ATP|[1, 0, 1, A, T, P, ]|[, 0, 1, 1, A, P, T]|011APT|
|2BCA|ZER987|[Z, E, R, 9, 8, 7, ]|[, 7, 8, 9, E, R, Z]|789ERZ|
+----+------+--------------------+--------------------+------+

Upvotes: 2

Gelerion
Gelerion

Reputation: 1724

You could achieve it by using udf

val ds = Seq(
      ("1ABC", "101ATP"),
      ("2BCA", "ZER987")
    ).toDF("col_a", "col_b")


val sortUdf = spark.udf.register("sort", (value: String) => value.sorted)

ds.select(sortUdf($"col_a").as("col_a"), sortUdf($"col_b").as("col_b"))
  .show(false)

+-----+------+
|col_a|col_b |
+-----+------+
|1ABC |011APT|
|2ABC |789ERZ|
+-----+------+

Upvotes: 2

Related Questions