Reputation: 17
When I perform a Select
operation on a DataFrame
in PySpark it reduces to the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 8 | f1b | f2f |
| 9 | f1a | f2d |
| 4 | f1b | f2c |
| 6 | f1b | f2a |
| 1 | f1c | f2c |
| 3 | f1c | f2g |
| 9 | f1c | f2e |
+-----+--------+-------+
I require the val
column to be ordered group wise based on another field Feat1
like the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 3 | f1a | f2d |
| 1 | f1b | f2c |
| 2 | f1b | f2a |
| 3 | f1b | f2f |
| 1 | f1c | f2c |
| 2 | f1c | f2g |
| 3 | f1c | f2e |
+-----+--------+-------+
NOTE that the val
values don't depend on the order of Feat2
but are instead ordered based on their original val
values.
Is there a command to reorder the column value in PySpark as required.
NOTE: Question exists for the same but is specific to SQL-lite.
Upvotes: 1
Views: 976
Reputation: 1575
data = [(1, 'f1a', 'f2a'),
(2, 'f1a', 'f2b'),
(8, 'f1b', 'f2f'),
(9, 'f1a', 'f2d'),
(4, 'f1b', 'f2c'),
(6, 'f1b', 'f2a'),
(1, 'f1c', 'f2c'),
(3, 'f1c', 'f2g'),
(9, 'f1c', 'f2e')]
table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])
Edit: For this purpose, you can use window with rank function:
from pyspark.sql import Window
from pyspark.sql.functions import rank
w = Window.partitionBy('Feat1').orderBy('val')
table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
+---+-----+-----+
|val|Feat1|Feat2|
+---+-----+-----+
| 1| f1a| f2a|
| 2| f1a| f2b|
| 3| f1a| f2d|
| 1| f1b| f2c|
| 2| f1b| f2a|
| 3| f1b| f2f|
| 1| f1c| f2c|
| 2| f1c| f2g|
| 3| f1c| f2e|
+---+-----+-----+
Upvotes: 2