Reputation: 3043
Let say I have list of values:
list_codes = ["code_123", "code_456"]
and there is PySpark data frame
+----------+-----------+
| code | value |
+----------+-----------+
| code_456 | value_456 |
| code_123 | value_123 |
+----------+-----------+
I need to subset data from data frame in the following order according to order of list values
+----------+-----------+
| code | value |
+----------+-----------+
| code_123 | value_123 |
| code_456 | value_456 |
+----------+-----------+
When I use this command it keeps original order of elements
df_subset = df.filter(f.col("code").isin(list_codes))
How to fix this sorting issue (actual when we have more than 2 values in list of course) ?
Thanks!
Upvotes: 0
Views: 159
Reputation: 163
I would tranform your ordering values list to a dataframe with by adding a rank column then do inner join for filtering and finally ordering by the rank column. Here is the example :
list_codes = ["code_123", "code_9", "code_456"]
data_codes = [(v,i) for i,v in enumerate(list_codes)]
df_codes = spark.createDataFrame(data=data_codes, schema = ["code", "rank"])
df_codes.show()
+--------+----+
| code|rank|
+--------+----+
|code_123| 0|
| code_9| 1|
|code_456| 2|
+--------+----+
And the joining + ordering part would be (you can drop the rank column at the end of course) :
data = [("code_123","value_123"),
("code_456","value_456"),
("code_789","value_789"),
("code_9","value_9"),
]
colums = ["code", "value"]
df = spark.createDataFrame(data=data, schema = colums).join(df_codes, ["code"], "inner") \
.orderBy("rank") \
# .drop("rank")
df.show()
+--------+---------+----+
| code| value|rank|
+--------+---------+----+
|code_123|value_123| 0|
| code_9| value_9| 1|
|code_456|value_456| 2|
+--------+---------+----+
Don't forget to mark the answer if it serves your needs :).
Upvotes: 2