Andrii
Andrii

Reputation: 3043

How subset data in PySpark according list of values

Let say I have list of values:

list_codes = ["code_123", "code_456"]

and there is PySpark data frame

+----------+-----------+
| code     |   value   |
+----------+-----------+
| code_456 | value_456 |
| code_123 | value_123 |
+----------+-----------+

I need to subset data from data frame in the following order according to order of list values

+----------+-----------+
| code     |   value   |
+----------+-----------+
| code_123 | value_123 |
| code_456 | value_456 |
+----------+-----------+

When I use this command it keeps original order of elements

df_subset = df.filter(f.col("code").isin(list_codes))

How to fix this sorting issue (actual when we have more than 2 values in list of course) ?

Thanks!

Upvotes: 0

Views: 159

Answers (1)

Yassine ELB
Yassine ELB

Reputation: 163

I would tranform your ordering values list to a dataframe with by adding a rank column then do inner join for filtering and finally ordering by the rank column. Here is the example :

list_codes = ["code_123", "code_9", "code_456"]
data_codes = [(v,i) for i,v in enumerate(list_codes)]
df_codes = spark.createDataFrame(data=data_codes, schema = ["code", "rank"])
df_codes.show()

+--------+----+
|    code|rank|
+--------+----+
|code_123|   0|
|  code_9|   1|
|code_456|   2|
+--------+----+

And the joining + ordering part would be (you can drop the rank column at the end of course) :

data = [("code_123","value_123"), 
        ("code_456","value_456"), 
       ("code_789","value_789"), 
        ("code_9","value_9"), 
      ]
colums = ["code", "value"]
df = spark.createDataFrame(data=data, schema = colums).join(df_codes, ["code"], "inner") \
        .orderBy("rank") \
#         .drop("rank")
df.show()

+--------+---------+----+
|    code|    value|rank|
+--------+---------+----+
|code_123|value_123|   0|
|  code_9|  value_9|   1|
|code_456|value_456|   2|
+--------+---------+----+

Don't forget to mark the answer if it serves your needs :).

Upvotes: 2

Related Questions