reorder column values pyspark

Question

When I perform a Select operation on a DataFrame in PySpark it reduces to the following:

+-----+--------+-------+
| val |  Feat1 | Feat2 |
+-----+--------+-------+
| 1   | f1a    | f2a   |
| 2   | f1a    | f2b   |
| 8   | f1b    | f2f   |
| 9   | f1a    | f2d   |
| 4   | f1b    | f2c   |
| 6   | f1b    | f2a   |
| 1   | f1c    | f2c   |
| 3   | f1c    | f2g   |
| 9   | f1c    | f2e   |
+-----+--------+-------+

I require the val column to be ordered group wise based on another field Feat1 like the following:

+-----+--------+-------+
| val |  Feat1 | Feat2 |
+-----+--------+-------+
| 1   | f1a    | f2a   |
| 2   | f1a    | f2b   |
| 3   | f1a    | f2d   |
| 1   | f1b    | f2c   |
| 2   | f1b    | f2a   |
| 3   | f1b    | f2f   |
| 1   | f1c    | f2c   |
| 2   | f1c    | f2g   |
| 3   | f1c    | f2e   |
+-----+--------+-------+

NOTE that the val values don't depend on the order of Feat2 but are instead ordered based on their original val values.

Is there a command to reorder the column value in PySpark as required.

NOTE: Question exists for the same but is specific to SQL-lite.

Abhishek Bansal · Accepted Answer

data = [(1, 'f1a', 'f2a'),
 (2, 'f1a', 'f2b'),
 (8, 'f1b', 'f2f'),
 (9, 'f1a', 'f2d'),
 (4, 'f1b', 'f2c'),
 (6, 'f1b', 'f2a'),
 (1, 'f1c', 'f2c'),
 (3, 'f1c', 'f2g'),
 (9, 'f1c', 'f2e')]              

table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])

Edit: For this purpose, you can use window with rank function:

from pyspark.sql import Window
from pyspark.sql.functions import rank

w = Window.partitionBy('Feat1').orderBy('val')

table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
+---+-----+-----+
|val|Feat1|Feat2|
+---+-----+-----+
|  1|  f1a|  f2a|
|  2|  f1a|  f2b|
|  3|  f1a|  f2d|
|  1|  f1b|  f2c|
|  2|  f1b|  f2a|
|  3|  f1b|  f2f|
|  1|  f1c|  f2c|
|  2|  f1c|  f2g|
|  3|  f1c|  f2e|
+---+-----+-----+

reorder column values pyspark

Answers (1)

Related Questions