MMV
MMV

Reputation: 208

how to convert values in a column to array in an order defined in a list

I have a list of values like below

[hpc_max,asset_min,off_median]

and a table like this

| item_name | item_value | timestamp                     |
| --------- | ----------| -----------------------------|
| hpc_max   | 0.25      | 2023-03-01T17:20:00.000+0000 |
| asset_min | 0.34      | 2023-03-01T17:20:00.000+0000 |
| off_median| 0.30      | 2023-03-01T17:30:00.000+0000 |
| hpc_max   | 0.54      | 2023-03-01T17:30:00.000+0000 |
| asset_min | 0.32      | 2023-03-01T17:35:00.000+0000 |
| off_median| 0.67      | 2023-03-01T17:20:00.000+0000 |
| asset_min | 0.54      | 2023-03-01T17:30:00.000+0000 |
| off_median| 0.32      | 2023-03-01T17:35:00.000+0000 |
| hpc_max   | 0.67      | 2023-03-01T17:35:00.000+0000 |

and I want to group the item_values based on the timestamp in an array type in the order of item names that are in the list

The output I want is

item_name item_value timestamp
["hpc_max","asset_min","off_median"] [0.25,0.34,0.67] 2023-03-01T17:20:00.000+0000
["hpc_max","asset_min","off_median"] [0.54,0.54,0.30] 2023-03-01T17:30:00.000+0000
["hpc_max","asset_min","off_median"] [0.67,0.32,0.32] 2023-03-01T17:35:00.000+0000

how can I do this using pyspark?

would appreciate any help!

Upvotes: 0

Views: 76

Answers (1)

Arud Seka Berne S
Arud Seka Berne S

Reputation: 953

Your DataFrame (df_1)

+----------+----------+----------------------------+
|item_name |item_value|timestamp                   |
+----------+----------+----------------------------+
|hpc_max   |0.25      |2023-03-01T17:20:00.000+0000|
|asset_min |0.34      |2023-03-01T17:20:00.000+0000|
|off_median|0.3       |2023-03-01T17:30:00.000+0000|
|hpc_max   |0.54      |2023-03-01T17:30:00.000+0000|
|asset_min |0.32      |2023-03-01T17:35:00.000+0000|
|off_median|0.67      |2023-03-01T17:20:00.000+0000|
|asset_min |0.54      |2023-03-01T17:30:00.000+0000|
|off_median|0.32      |2023-03-01T17:35:00.000+0000|
|hpc_max   |0.67      |2023-03-01T17:35:00.000+0000|
+----------+----------+----------------------------+

Try this

from pyspark.sql.functions import col, when, collect_list

df_2 = df_1.withColumn("item_index",
                       when(col("item_name")=="hpc_max", 1) \
                       .when(col("item_name")=="asset_min", 2) \
                       .otherwise(3)
                      )

df_3 = df_2.orderBy("timestamp","item_index")

df_3.groupBy("timestamp").agg(
    collect_list("item_name").alias("item_name"),
    collect_list("item_value").alias("item_value")
).show(truncate=False)

Output

+----------------------------+--------------------------------+------------------+
|timestamp                   |item_name                       |item_value        |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[hpc_max, asset_min, off_median]|[0.67, 0.32, 0.32]|
|2023-03-01T17:30:00.000+0000|[hpc_max, asset_min, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[hpc_max, asset_min, off_median]|[0.25, 0.34, 0.67]|
+----------------------------+--------------------------------+------------------+

Upvotes: 0

Related Questions