Reputation: 208
I have a list of values like below
[hpc_max,asset_min,off_median]
and a table like this
| item_name | item_value | timestamp |
| --------- | ----------| -----------------------------|
| hpc_max | 0.25 | 2023-03-01T17:20:00.000+0000 |
| asset_min | 0.34 | 2023-03-01T17:20:00.000+0000 |
| off_median| 0.30 | 2023-03-01T17:30:00.000+0000 |
| hpc_max | 0.54 | 2023-03-01T17:30:00.000+0000 |
| asset_min | 0.32 | 2023-03-01T17:35:00.000+0000 |
| off_median| 0.67 | 2023-03-01T17:20:00.000+0000 |
| asset_min | 0.54 | 2023-03-01T17:30:00.000+0000 |
| off_median| 0.32 | 2023-03-01T17:35:00.000+0000 |
| hpc_max | 0.67 | 2023-03-01T17:35:00.000+0000 |
and I want to group the item_values based on the timestamp in an array type in the order of item names that are in the list
The output I want is
item_name | item_value | timestamp |
---|---|---|
["hpc_max","asset_min","off_median"] | [0.25,0.34,0.67] | 2023-03-01T17:20:00.000+0000 |
["hpc_max","asset_min","off_median"] | [0.54,0.54,0.30] | 2023-03-01T17:30:00.000+0000 |
["hpc_max","asset_min","off_median"] | [0.67,0.32,0.32] | 2023-03-01T17:35:00.000+0000 |
how can I do this using pyspark?
would appreciate any help!
Upvotes: 0
Views: 76
Reputation: 953
Your DataFrame (df_1)
+----------+----------+----------------------------+
|item_name |item_value|timestamp |
+----------+----------+----------------------------+
|hpc_max |0.25 |2023-03-01T17:20:00.000+0000|
|asset_min |0.34 |2023-03-01T17:20:00.000+0000|
|off_median|0.3 |2023-03-01T17:30:00.000+0000|
|hpc_max |0.54 |2023-03-01T17:30:00.000+0000|
|asset_min |0.32 |2023-03-01T17:35:00.000+0000|
|off_median|0.67 |2023-03-01T17:20:00.000+0000|
|asset_min |0.54 |2023-03-01T17:30:00.000+0000|
|off_median|0.32 |2023-03-01T17:35:00.000+0000|
|hpc_max |0.67 |2023-03-01T17:35:00.000+0000|
+----------+----------+----------------------------+
Try this
from pyspark.sql.functions import col, when, collect_list
df_2 = df_1.withColumn("item_index",
when(col("item_name")=="hpc_max", 1) \
.when(col("item_name")=="asset_min", 2) \
.otherwise(3)
)
df_3 = df_2.orderBy("timestamp","item_index")
df_3.groupBy("timestamp").agg(
collect_list("item_name").alias("item_name"),
collect_list("item_value").alias("item_value")
).show(truncate=False)
Output
+----------------------------+--------------------------------+------------------+
|timestamp |item_name |item_value |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[hpc_max, asset_min, off_median]|[0.67, 0.32, 0.32]|
|2023-03-01T17:30:00.000+0000|[hpc_max, asset_min, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[hpc_max, asset_min, off_median]|[0.25, 0.34, 0.67]|
+----------------------------+--------------------------------+------------------+
Upvotes: 0